This repository was archived by the owner on Aug 5, 2024. It is now read-only.

Conversation

dmsnell

Fixes #10, #59, #68
Alternate to #69

Fixes for Java, JavaScript, Objective C, Python2, Python3

Status

  • Please scrutinize the code and think of any test cases not covered already in this .
  • We've deployed these changes in Simplenote and are uncovering any issues we can find. We've been running the code in this for years now and no issues or regressions have appeared.
  • Please let us know your thoughts on the direction here. This is a pragmatic in that it's addressing the problem where it's at and tries to impose no new constraints on the library where possible.
  • Since we haven't received any direction it's my intention to merge this into my fork when done and start referencing dmsnell/diff-match- instead of google/diff-match-. I'd prefer merging these fixes upstream and would happily do the work required if @NeilFraser provides guidance or opinions or a simple 👍.
  • Everything in this improves current problems. Even though this doesn't resolve all of the problems breaking surrogate pairs everywhere it's only a net win for many use cases. I'm working on fixing this everywhere though, but don't see any reason to

Todo:

  • Fix CPP or verify not a problem
  • Fix CSharp or verify not a problem
  • Fix Dart or verify not a problem
  • Fix Lua or verify not a problem
  • Refactor to use cleanupSplitSurrogates in JavaScript
  • Refactor to use cleanupSplitSurrogates in Java
  • Refactor to use cleanupSplitSurrogates in Objective C
  • Refactor to use cleanupSplitSurrogates in Python2
  • Refactor to use cleanupSplitSurrogates in Python3
  • Refactor to use cleanupSplitSurrogates in CPP
  • Refactor to use cleanupSplitSurrogates in CSharp
  • Refactor to use cleanupSplitSurrogates in Dart
  • Refactor to use cleanupSplitSurrogates in Lua
  • Fix _toText in JavaScript
  • Fix _toText in Java
  • Fix _toText in Objective C
  • Fix _toText in Python2
  • Fix _toText in Python3
  • Fix _toText in CPP
  • Fix _toText in CSharp
  • Fix _toText in Dart
  • Fix _toText in Lua
  • Figure out a "minimal" set of unit tests so we can get rid of the big chunk currently in the PR, then carry it around to all the libraries. The triggers are well understood, so we can write targeted tests instead of broad ones.

Major changes

  • in toDelta() we no longer attempt to create URI-encoded strings with split surrogates. this was either crashing client applications or corrupting the data, depending on which platform was in use.
  • in fromDelta() we are recovering when possible from corrupted delta strings produced by uned versions of this library. there are two fixes:
    • sometimes we encode surrogate-pair halves as if they were valid Unicode (they aren't). we'll now decode those anyway because after applying a diff with these invalid code points we should re-join the surrogate halves and get valid Unicode out
    • in uned ObjectiveC libraries we might have created invalid diffs where we send (null) in between two surrogate halves. this is caused by the original bug and we might receive these sequences in a delta string: (high surrogate)(null)(low surrogate). If we leave these in here then diff-match- will operate fine but it will send invalid Unicode that it created onto the consuming application. in this I've added guards against this very specific pattern so that we can remove the unintentionally injected (null) and re-join the surrogate halves. this means that we're mangling the input but that input would have already been mangled anyway and presumably it'd be "impossible" to send the same sequence of code units to diff-match- since something would have had to send invalid Unicode in the first place.

Working on this bug has surfaced a bigger question about what diff-match- expects. All of the Simplenote applications work at the level of UTF-16 code units and a large part of that is because this is how most of the diff-match- platforms work.

Something seems ideal about referencing Unicode code points in diff-match- instead of any particular encoding, but to change now seems like it would break way more than it would fix. I'd like to propose that we think about creating a new output version which could be selected to specify that diffs do that.

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" )
  // no specification means existing behavior
) // "=2	+%F0%9F%85%B1	=2";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'native' } // existing behavior - whatever the unit of a string is
) // "=?	+%F0%9F%85%B1	=?";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'ucs2' } // make UCS2 the "default" since most libraries used it already, or something
) // "=2	+%F0%9F%85%B1	=2";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'codePoint' } // empty addition group at the beginning to denote Unicode indices
) // "+	=1	+%F0%9F%85%B1	=1"; // empty group ignored by old clients

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into toDelta() we do just that and the
library crashes. In this we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but shouldn't change
that much - only by a single character at surrogate boundaries.

Notes

  • This is my first contribution to this repo and I'm obviously not familiar with the style and this is obviously incomplete. My purpose is to propose code with working tests to demonstrate the change and I expect to clean up any remaining details that need resolving.
  • I've pulled in the surrogate tests that @josephrocca added and they are all passing. In JS: Handle surrogate pairs correctly #69 the issue causing these failed tests was using isSurrogate() for all cases when we should have been using isHighSurrogate() and isLowSurragate() depending on the context.

Thanks for any and all help and feedback.

@dmsnelldmsnell changed the title JavaScript: Stop breaking surrogate pairs in toDelta() Stop breaking surrogate pairs in toDelta() Nov 9, 2019
@dmsnelldmsnell force-pushed the issues/69-broken-surrogate-pairs branch from 01426e2 to 71646fb Compare November 9, 2019 21:07
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Nov 12, 2019
See google/diff-match-#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Nov 12, 2019
See google/diff-match-#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
dmsnell added a commit to Automattic/simplenote-electron that referenced this pull request Nov 13, 2019
See: Simperium/node-simperium#91
See: google/diff-match-#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.

This  updates the dependency on the ed version of
`node-simperium`.
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Nov 14, 2019
See google/diff-match-#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
dmsnell added a commit to Automattic/simplenote-electron that referenced this pull request Nov 14, 2019
* Fix: No more crashing/skipped changes for certain changes

See: Simperium/node-simperium#91
See: google/diff-match-#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.

This  updates the dependency on the ed version of
`node-simperium`.
@dmsnell

@NeilFraser is there anything I can do to help this /issue move along?

@dmsnelldmsnell force-pushed the issues/69-broken-surrogate-pairs branch 2 times, most recently from fbb8c73 to 55a8779 Compare December 13, 2019 06:47
@dmsnell

Quick update for those following: we identified issues when the surrogate pairs were in DIFF_DELETE groups which led to crashing behaviors.

I've updated the PR with an improved to the JS/Java libraries and hope to port the changes to the other libraries tomorrow.

In this update I've included a custom decodeURI. The Python library happily URI-encodes a string that isn't valid UTF-8 and that causes other libraries to crash. Well, it's fine by us if we treat half of a surrogate pair as its own code point. The illegitimate part comes because those code points are reserved for surrogate pairs, not because there's no way to decode the value as if it were a real code point.

This custom decodeURI makes it possible for clients to handle diff deltas produced by libraries that don't have the fixes in this PR.

@dmsnelldmsnell changed the title Stop breaking surrogate pairs in toDelta() Stop breaking surrogate pairs in toDelta()/fromDelta() Dec 13, 2019
@NeilFraser

Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python?

@dmsnell

Thank you so much @NeilFraser for the response!

Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python?

Right now it appears like diff-match- leaves this unspecified. The Lua library of course makes this plain and clear.

It doesn't even need to be a question about language though. Python2 is internally inconsistent because if Python is compiled in narrow mode it returns different numbers than if Python is compiled in wide mode.

In my I'm normalizing doDelta on UTF-16 code units because that seems most consistent with most libraries in most situations. Fundamentally I don't see a way to provide full backwards compatibility on this change because there was no backwards behavior - or the behavior was already backwards 🙃

In other words before my it's more different across libraries and platforms than after it. After this the languages which have been implemented here should all return the same delta - delete one character, insert another.

In tests across JavaScript, Java, Objective-C, Python2, Python3

@dmsnell

@NeilFraser is there a way to run the suite of relative performance diffs or is that simply generated from running speedtest in each language/platform?

I've been running speedtest on this PR and I haven't seen anything significant in terms of performance. I think I've buttoned up about as much as we can with the bug.

the only real major issue is the question of what the indices should be, and I've tried to not change that other than in Python2 compiled in wide mode and in Python 3. we can back out the Python 3 change if we need to preserve the behavior. the other languages/platforms in this all previously returned UCS-2 code units.

@dmsnelldmsnell mentioned this pull request Jan 7, 2020
@dmsnelldmsnell force-pushed the issues/69-broken-surrogate-pairs branch from bdc4740 to fa122d3 Compare January 14, 2020 01:49
@salzhrani

We also faced an issue with surrogate pairs here where the substring breaking doesn't account for surrogate pairs.

dmsnell added a commit to Simperium/node-simperium that referenced this pull request Feb 5, 2020
In 1.0.2 we applied an update from google/diff-match-#80 to resolve the
surrogate pair encoding issue.  Since that time we found a bug in the fix and
resolved that and therefore this  brings those additional updates to
Simeprium.

There previously remained issues when decoding broken es that already
existed or which came from uned versions of the iOS library and also
remained issues when unexpectedly receiving empty diff groups in `toDelta`
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Feb 17, 2020
In 1.0.2 we applied an update from google/diff-match-#80 to resolve the
surrogate pair encoding issue.  Since that time we found a bug in the fix and
resolved that and therefore this  brings those additional updates to
Simeprium.

There previously remained issues when decoding broken es that already
existed or which came from uned versions of the iOS library and also
remained issues when unexpectedly receiving empty diff groups in `toDelta`
@dmsnell

@NeilFraser what can we do to be helpful here? We've been running with these changes in Simplenote for more than a month and it eliminated all of the encoding issues we were experiencing.

I know that there is still a "better option" of adjusting the way that we break strings up so that we never get broken surrogate halves, but from a practicality and simplicity standpoint I think this is solid (maybe there are bugs we've silently missed).

@dmsnell

When reading through a Rust implementation I found that the author created a separate cleanup_char_boundary(diffs) function to adjust the output when the diff groups would otherwise split. that's a method very close to this one except it occurs in the diffing stage instead of the delta production.

https://.com/dtolnay/dissimilar/blob/master/src/lib.rs#L402

@mustafamunawar

@dmsnell What's the status of this? Does Unicode work correctly yet in master?

@dmsnell

@mustafamunawar we are waiting feedback or direction from @NeilFraser. We have been using this branch successfully in Simplenote for the past nine months without issue. You are welcome to try it out and watch this PR for an eventual merge.

valr added a commit to valr/website that referenced this pull request May 3, 2021
@dmsnell

@dmak

first doing the diff on byte level and then mutating (extending / shrinking) a bit the resulting diffs so that they do not eventually stop in the middle of multi-byte sequence - @dmak

this is the basic procedure in this PR as well as has been taken in the dissimilar library linked above. it's a good idea and one that solves the problem of crashing when splitting Unicode boundaries.

linking here to avoid clouding your Maven PR (#2) with noise about Unicode boundaries

@dmsnell

@NeilFraser I ran into this problem again today through a dependency-of-a-dependency pulling in diff-match- (via jsondiff) and wondering what I can do to help move this forward.

For Simplenote we're able to pull in this branch and that works, but with the transitive dependencies in other projects we don't get as much of a luxury. Would rather avoid forking the repo especially since everything in the world basically refers to this one (and you've put so much work into it), but I feel like if we don't fix master we're going to continue to spot this Unicode violation over and over again and I don't know how to get there without your help.

Again, I'm happy to put more energy into the fix. We've been going for more than two years with this branch in Simplenote and as far as we can infer the issue is entirely eliminated and no side-effects have crept in. Our users write some very long texts too and we've had no indication of performance issues.

The approach in the Rust library I think is more elegant than my solution but I think mine here is the least intrusive on the overall library, addressing the problem of splitting surrogate pairs only where it becomes a problem, that is, when URL-encoding the .

Really appreciate all you've built here and I assume you're busy; just trying to find a way to keep this from being a constant headache.

@gamedevsam

I've tried all 3 solutions (#13, #69, #80) to this problem and #13 is the only one that solved the encodingURI exception for me. I created a minimal repo which clearly reproduces the issue: https://.com/gamedevsam/diff-match--emoji-issue

Hope it's helpful to make this solution more robust.

@dmsnell

@gamedevsam I think it's possible you confused diff-match- with jsondiff. This has been working fine with Simplenote for coming up on three years now and has been fully reliable.

It does limit itself to toDelta, as noted, and jsondiff is using the toText functions. If you'd like you can propose a commit on this PR to add support there as the process should be the same as with toDelta. I'm busy on other things so I probably won't do this any time soon without other contributions and since @NeilFraser has been silent on the issue (if @NeilFraser could give us some direction or indication that this would merge I'd go ahead and clean up the ).

Unfortunately #13 has a massive performance problem particularly for long texts and likely that approach (converting strings to arrays of integers) won't be viable for a generic library like this, especially since this (#80) has the fix to get to the same result.

You can check with the author of jsondiff and look at these lines. Get that library to use this and then apply the same updates to toText as we have with toDelta and it should work for you 👍

@michal-kurz

@dmsnell Thank you so much for all the work you've put into this accross all the threads! I initially tried fixing this myself, and you helped me realize my approach was way too naive! :)

We use jsondiff, and are running into this problem. Since this repo seems to be dead, I would love to make a public jsondiff fork that fixes this issue. I came up with a POC fix of toText() function which builds on top of your PR, Here it is. Would you please care to review it? It seems to fix our use-case, but it's very difficult for me predict if it could introduce any new problems (the tests pass, but that doesn't give me much confidence) - maybe you will see something I don't :)

About my solution:

  • I pretty much just separated your surrogate-pair-joining logic into a separate function, as it seems to me like a concern separate from any other - it may make sense to use it in toDelta() also, but I didn't do that for now to keep the diff minimal
  • I'm also wondering whether it's optimal to be fixing this problem at point of toDelta()/toText() conversion - have you maybe considered fixing at creation, so this problem would never occur in the first place? Maybe inside _make, or something. Although it may very well be that other functions like diff_cleanupSemantic could re-break the ... I haven't tried this, just wondering if there could be a central way to fix this once for good :)
  • It's quite clear that dpm prioritizes performance above all else, and it's hard for me to reason about such micro-optimizations.
    • Do you think there is any value in not creating a separate newDiffs array, but writing into the existing one instead (and popping the difference in length in the end, since your algo skips empty es)?
    • Likewise, it might be preferable not to use my function inside toDelta(), as it would require an additional iteration of all diffs for each

🙏

@dmsnell

Nice work @michal-kurz. If you want to propose a PR with that code against this branch we can merge it in. I'd like to explore using your function in toDelta, and I'd like to look at removing the extra iteration, as I don't believe it offers any value above the cloning method you took (though I'm happy to be shown something I'm overlooking!)

@michal-kurz

@dmsnell Thank you! I made a PR to your branch. Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the diff keeps looking different from my local, no idea why. I'm way too frustrated to continue with that at this point 😠

As for incorporating the splitSurrotagePairs into toDelta: yes, I don't think there is anything to gain other than proper separation of that concern, which I tend to generally value much higher than preventing an additional iteration - I really don't think there is a clean way to retain both. But I guess you could find some middle ground (i.e. sending a callback inside splitSurrogatePairs to do the bidding of the invoker rather than returning corrected diffs).

@dmsnell

Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the diff keeps looking different from my local

oh the pain; I feel you: been there, done that. typically I find that some IDE is adding or removing the extra whitespace. did you try using nano or vim or pico or some editor that won't muck around with anthing?

might also verify you don't have custom pre-commit hooks that are running formatting.


by the way I suspect you meant to poke me, @dmsnell, and not dmaclach

@michal-kurz

@dmsnell Solved, I ended up soft-resetting, re-commiting only the desired lines and force-pushing the result.

And yes, I did mess up the mention tag, also fixed.

@michal-kurz

So unfortunately I found my solution for toText to be incomplete.

I sometimes end up with a single surrogate on the very edge of the whole (first char of first diff, or last char of last diff). For example:

str1 = '. 🔧🚗\n\--X'
str2 = '. 🔧🚗\n\--Y'

dmp._make(str1, str2)
// Results in three diffs:
// [
//    {0: 0, 1: '\uDE97\n--'},
//    {0: -1, 1: 'X'},
//    {0: 1, 1: 'Y'}
// ]

Such case cannot be fully "losslessly" shifted with @dmsnell 's approach - I have to decide to either expand or shrink the whole diff scope.

In my case, the broken diffs were created as prefix and suffix generated inside _addContext_ - those happen to get sampled at a fixed distance from the actual change.

I ed those by extending them one character outward in case it's an unpaired surrogate, and this seems to have solved the problem (here is a PR targeted at our fork of dmp). I decided for this approach, because I'm worried that shrinking the scope of the diff could cause some unforseen problems (mainly when coverting back to via fromText and then using inside other functions). And I unfortunately couldn't do this inside toText, because I don't have access to the full text there (and thus don't know what character to expand my diff with).

I hope this is the only potential source of such broken . If so, the toDelta solution in this PR should be safe, as prefix and suffix are both DIFF_EQUAL, and toDelta is only concerned about diff's length in that case, not its contents.

@dmsnell

@michal-kurz I added a that's almost identical to yours here (should have read yours first 🙃) but I make a couple extra bounds checks to avoid extending the context before or after the document boundaries.

had to apply both your fix though and the _toDelta fix at the same time else we were still getting split surrogates in the _ family of functions. by moving my fix into a new cleanupSplitSurrogates function though that seems to cover the bases.

one thing I wish were better is that when I moved cleanupSplitSurrogates into the same chain of fixup commands through the stages of diffing, it seemed to break. as is, it appears to be necessary as the last cleanup in the chain, reserved for output functions that encode URIs.

that being said, we're still without regression in Simplenote and other places at Automattic where we've deployed this and it's been a few years now, so I think whatever practical concerns there may be with this approach, it's clearly better than master and clearly not introducing surprise gotchas that will amount to anything near the existing problems. in other words, we haven't found any problems with it even though there could be something in this I've missed.

if there are problems with the approach in this PR, they are so rare that it's not comparable to the wildly-common bugs present in master.

Resolves google#69 for JavaScript

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this  we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Java

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this  we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Objective-C

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this  we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python2

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this  we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python3

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this  we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
@dmsnelldmsnell force-pushed the issues/69-broken-surrogate-pairs branch from d681ef6 to 50f1542 Compare January 30, 2024 23:54
@dmsnell

All: I have merged this into my fork and it's now in the master branch.
I'd welcome help in porting the fix to the other libraries, or in rewriting the fix to use the approach taken by dissimilar.

Unfortunately I won't have a lot of time to actively maintain the fork, but I will do my best to be responsive and help merge in updates.

Sign up for free to subscribe to this conversation on . Already have an account? Sign in.
None yet
None yet

Successfully merging this pull request may close these issues.

JavaScript implementation crashes on Unicode code points