This repository was archived by the owner on Aug 5, 2024. It is now read-only.

Conversation

emcsween

In the Javascript version, when using diff_cleanupSemantic(), some diffs result in the semantic alignment loop being run many times. This happens when comparing a file containing a long chunk of characters with a similar file containing the same long chunk of characters twice in succession, i.e.:

File 1: <chunk A><chunk B><chunk C>
FIle 2: <chunk A><chunk B><chunk B><chunk C>

When this happens, the loop runs as many times as there are characters in chunk B. This can get quite expensive because three new strings are created in every iteration. This PR replaces these with faster index manipulations.

I tried to translate the algorithm line by line with the objective of changing nothing in its behaviour. Basically, instead of tracking 3 strings (equality1, edit, equality2), I track 1 string (buffer) and 2 indices (editStart, editEnd). They are related this way:

buffer: | equality1 | edit | equality2 |
                    ^      ^
       editStart ---+      +---- editEnd

The other change I made was to change the loop condition. The original code shifts the edit left as much as possible (using the common suffix between equality1 and edit) and shifts right until the first character of edit and equality2 are different. I changed that to counting the common prefix between edit and equality2 and adding it to the amount of right shift to get the total number of shifts required.

I used the following benchmark to force the loop to run an arbitrary number of times:

const dmp = require('./diff_match__uncompressed')
const text1 = 'a'.repeat(50) + 'b'.repeat(SIZE) + 'c'.repeat(50);
const text2 = 'a'.repeat(50) + 'b'.repeat(SIZE*2) + 'c'.repeat(50);
const diffs = dmp.diff_main(text1, text2)
dmp.diff_cleanupSemantic(diffs)

Here are the timings I got for diff_cleanupSemantic() before and after this PR.

SIZEbeforeafter
100.061 ms0.019 ms
1000.48 ms0.03 ms
1,0005.2 ms0.13 ms
10,00065 ms1.2 ms
100,0002.2 s12.35 ms
1,000,000too long111 ms

Some diffs result in the semantic alignment loop being run many times.
This happens when comparing a file containing a long chunk of characters
with a similar file containing the same long chunk of characters twice
in succession.

Manipulating indexes rather than creating new strings at each iteration
makes the loop run much more quickly.
Sign up for free to subscribe to this conversation on . Already have an account? Sign in.
None yet
None yet

Successfully merging this pull request may close these issues.

@emcsween