Speed up the semantic alignment loop in Javascript by emcsween · Pull Request #103 · google/diff-match- ·

In the Javascript version, when using diff_cleanupSemantic(), some diffs result in the semantic alignment loop being run many times. This happens when comparing a file containing a long chunk of characters with a similar file containing the same long chunk of characters twice in succession, i.e.:

File 1: <chunk A><chunk B><chunk C>
FIle 2: <chunk A><chunk B><chunk B><chunk C>

When this happens, the loop runs as many times as there are characters in chunk B. This can get quite expensive because three new strings are created in every iteration. This PR replaces these with faster index manipulations.

I tried to translate the algorithm line by line with the objective of changing nothing in its behaviour. Basically, instead of tracking 3 strings (equality1, edit, equality2), I track 1 string (buffer) and 2 indices (editStart, editEnd). They are related this way:

buffer: | equality1 | edit | equality2 |
                    ^      ^
       editStart ---+      +---- editEnd

The other change I made was to change the loop condition. The original code shifts the edit left as much as possible (using the common suffix between equality1 and edit) and shifts right until the first character of edit and equality2 are different. I changed that to counting the common prefix between edit and equality2 and adding it to the amount of right shift to get the total number of shifts required.

I used the following benchmark to force the loop to run an arbitrary number of times:

const dmp = require('./diff_match__uncompressed')
const text1 = 'a'.repeat(50) + 'b'.repeat(SIZE) + 'c'.repeat(50);
const text2 = 'a'.repeat(50) + 'b'.repeat(SIZE*2) + 'c'.repeat(50);
const diffs = dmp.diff_main(text1, text2)
dmp.diff_cleanupSemantic(diffs)

Here are the timings I got for diff_cleanupSemantic() before and after this PR.

SIZE	before	after
10	0.061 ms	0.019 ms
100	0.48 ms	0.03 ms
1,000	5.2 ms	0.13 ms
10,000	65 ms	1.2 ms
100,000	2.2 s	12.35 ms
1,000,000	too long	111 ms

Some diffs result in the semantic alignment loop being run many times. This happens when comparing a file containing a long chunk of characters with a similar file containing the same long chunk of characters twice in succession. Manipulating indexes rather than creating new strings at each iteration makes the loop run much more quickly.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Conversation

Uh oh!

Uh oh!