Conversation

thiagotnunes

The Spanner server can return an internal error closing the connection stream. When this happens an InternalException is thrown with a specific EOS message. We are seeing this specially in long running PDML tasks.

This case is now retried by the client, either by resuming the existing transaction (when there is a resume token) or by starting a new transaction.

@google-clagoogle-cla bot added the cla: yesThis human has signed the Contributor License Agreement.label Jul 20, 2020
It is possible to have the stream closed with an EOS internal error.
This should be retried by the client. On this PR we add this retry logic.
@@ -55,23 +56,6 @@
this.rpc = rpc;
}

private ByteString initTransaction() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was just moved down (after the public methods).

@@ -127,20 +111,24 @@ long executePartitionedUpdate(final Statement statement, final Duration
}
}
break;
} catch (UnavailableException e) {
} catch (UnavailableException | InternalException e) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this catch block we retry on EOS exception.

private boolean shouldResumeOrRestartTransaction(Exception e) {
return e instanceof UnavailableException
|| (e instanceof InternalException
&& e.getMessage().contains("Received unexpected EOS on DATA frame from server"));
Copy link
Contributor Author

@thiagotnunes thiagotnunes Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hacky, but unfortunately we do not get a specific exception for this error, so we have to proxy through the exception message. This follows the approach taken in the bigquery client: https://.com/googleapis/java-bigquerystorage/pull/263/files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same error is also retried for queries, and this check could probably use the existing check in SpannerExceptionFactory:

private static boolean isRetryable(ErrorCode code, @Nullable Throwable cause) {
. It's still just as hacky, but it would keep it confined to one method.

@thiagotnunesthiagotnunes changed the title Retry PDML Transaction on EOS Internal Error fix: retry pdml transaction on EOS internal error Jul 21, 2020
@codecov

Codecov Report

❗ No coverage uploaded for pull request base (master@87f4f13). Click here to learn what that means.
The diff coverage is n/a.

@skuruppu

@thiagotnunes I'm going to hold off on reviewing this until I better understand why this error happens. Hope it's ok to wait a couple of days until the Spanner team has had a chance to look at the issue.

@thiagotnunesthiagotnunes added the do not mergeIndicates a pull request not ready for merge, due to either quality or timing.label Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change @thiagotnunes. I would also like @olavloite to take a look before merging.

@thiagotnunesthiagotnunes removed the do not mergeIndicates a pull request not ready for merge, due to either quality or timing.label Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally good to me, but could you have a look and see if it's possible to reuse the existing isRetryable method for the internal exception?

private boolean shouldResumeOrRestartTransaction(Exception e) {
return e instanceof UnavailableException
|| (e instanceof InternalException
&& e.getMessage().contains("Received unexpected EOS on DATA frame from server"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same error is also retried for queries, and this check could probably use the existing check in SpannerExceptionFactory:

private static boolean isRetryable(ErrorCode code, @Nullable Throwable cause) {
. It's still just as hacky, but it would keep it confined to one method.

@thiagotnunes

Thanks for the feedback @olavloite, I did not know about that functionality! I will make a change in this PR to use that.

When the exception is an EOS, we should retry the exception.
Re-uses the retry logic applied by the spanner exception factory for
retrying EOS internal exceptions in pdml transactions.
@@ -257,35 +256,8 @@ private static boolean hasCauseMatching(
}

private static class Matchers {
static final Predicate<Throwable> isRetryableInternalError =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted these predicates into their own classes that are now exposed to the PartitionedDmlTransaction. This was the cleanest / lowest impact way I could find to de-duplicate the EOS retryable logic.

@thiagotnunes

@olavloite @skuruppu do you mind giving the PR another look? I did the following changes:

  • Extracted predicates from the SpannerExceptionFactory in order to use them in the PartitionedDmlTransaction, thus removing the EOS logic duplication.
  • Changed the PartitionedDmlTransaction class to use the extracted predicate.
  • Added tests for the extracted classes.

public boolean apply(Throwable cause) {
if (isInternalError(cause)) {
if (cause.getMessage().contains(HTTP2_ERROR_MESSAGE)) {
// See b/25451313.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, if we could remove the references to internal issues that would be great. I don't know why they were there before. Folks working on this repo are unlikely to have access to those so no point exposing them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, will tackle this in a following PR.

@olavloite

@thiagotnunes Thanks for extracting the predicates for the retryable internal errors.
I have no further comments, this looks good to me.

gcf-merge-on-green bot pushed a commit that referenced this pull request Aug 20, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.60.0](https://www..com/googleapis/java-spanner/compare/v1.59.0...v1.60.0) (2020-08-18)


### Features

* adds clirr check on pre-commit hook ([#388](https://www..com/googleapis/java-spanner/issues/388)) ([bd5c93f](https://www..com/googleapis/java-spanner/commit/bd5c93f045e06372b2235f3d350bade93bff2c24))
* include SQL statement in error message ([#355](https://www..com/googleapis/java-spanner/issues/355)) ([cc5ac48](https://www..com/googleapis/java-spanner/commit/cc5ac48232b6e4550b98d213c5877d6ec37b293f))


### Bug Fixes

* enables emulator tests ([#380](https://www..com/googleapis/java-spanner/issues/380)) ([f61c6d0](https://www..com/googleapis/java-spanner/commit/f61c6d0d332f15826499996a292acc7cbab267a7))
* remove custom timeout and retry settings ([#365](https://www..com/googleapis/java-spanner/issues/365)) ([f6afd21](https://www..com/googleapis/java-spanner/commit/f6afd213430d3f06d9a72c64a5c37172840fed0e))
* remove unused kokoro files ([#367](https://www..com/googleapis/java-spanner/issues/367)) ([6125c7d](https://www..com/googleapis/java-spanner/commit/6125c7d221c77f4c42497b72107627ee09312813))
* retry pdml transaction on EOS internal error ([#360](https://www..com/googleapis/java-spanner/issues/360)) ([a53d736](https://www..com/googleapis/java-spanner/commit/a53d7369bb2a8640ab42e409632b352decbdbf5e))
* sets the project for the integration tests ([#386](https://www..com/googleapis/java-spanner/issues/386)) ([c8fa458](https://www..com/googleapis/java-spanner/commit/c8fa458f5369a09c780ee38ecc09bd2562e8f987))


### Dependencies

* stop auto updates of commons-lang3 ([#362](https://www..com/googleapis/java-spanner/issues/362)) ([8f07ed6](https://www..com/googleapis/java-spanner/commit/8f07ed6b44f9c70f56b9ee2e4505c40385337ca7))
* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.8.6 ([#374](https://www..com/googleapis/java-spanner/issues/374)) ([6f47b8a](https://www..com/googleapis/java-spanner/commit/6f47b8a759643f772230df0c2e153338d44f70ce))
* update dependency org.openjdk.jmh:jmh-core to v1.24 ([#375](https://www..com/googleapis/java-spanner/issues/375)) ([94f568c](https://www..com/googleapis/java-spanner/commit/94f568cf731ba22cac7f0d898d7776a3cc2c178f))
* update dependency org.openjdk.jmh:jmh-core to v1.25 ([#382](https://www..com/googleapis/java-spanner/issues/382)) ([ec7888e](https://www..com/googleapis/java-spanner/commit/ec7888e1d62cf800bf6ad166d242e89443ddc7aa))
* update dependency org.openjdk.jmh:jmh-generator-annprocess to v1.25 ([#376](https://www..com/googleapis/java-spanner/issues/376)) ([8ffdc48](https://www..com/googleapis/java-spanner/commit/8ffdc481e15901f78eac592bd8d4bef33ac3378a))
---


This PR was generated with [Release Please](https://.com/googleapis/release-please).
thiagotnunes pushed a commit to thiagotnunes/java-spanner that referenced this pull request Jun 5, 2021
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.60.0](https://www..com/googleapis/java-spanner/compare/v1.59.0...v1.60.0) (2020-08-18)


### Features

* adds clirr check on pre-commit hook ([googleapis#388](https://www..com/googleapis/java-spanner/issues/388)) ([bd5c93f](https://www..com/googleapis/java-spanner/commit/bd5c93f045e06372b2235f3d350bade93bff2c24))
* include SQL statement in error message ([googleapis#355](https://www..com/googleapis/java-spanner/issues/355)) ([cc5ac48](https://www..com/googleapis/java-spanner/commit/cc5ac48232b6e4550b98d213c5877d6ec37b293f))


### Bug Fixes

* enables emulator tests ([googleapis#380](https://www..com/googleapis/java-spanner/issues/380)) ([f61c6d0](https://www..com/googleapis/java-spanner/commit/f61c6d0d332f15826499996a292acc7cbab267a7))
* remove custom timeout and retry settings ([googleapis#365](https://www..com/googleapis/java-spanner/issues/365)) ([f6afd21](https://www..com/googleapis/java-spanner/commit/f6afd213430d3f06d9a72c64a5c37172840fed0e))
* remove unused kokoro files ([googleapis#367](https://www..com/googleapis/java-spanner/issues/367)) ([6125c7d](https://www..com/googleapis/java-spanner/commit/6125c7d221c77f4c42497b72107627ee09312813))
* retry pdml transaction on EOS internal error ([googleapis#360](https://www..com/googleapis/java-spanner/issues/360)) ([a53d736](https://www..com/googleapis/java-spanner/commit/a53d7369bb2a8640ab42e409632b352decbdbf5e))
* sets the project for the integration tests ([googleapis#386](https://www..com/googleapis/java-spanner/issues/386)) ([c8fa458](https://www..com/googleapis/java-spanner/commit/c8fa458f5369a09c780ee38ecc09bd2562e8f987))


### Dependencies

* stop auto updates of commons-lang3 ([googleapis#362](https://www..com/googleapis/java-spanner/issues/362)) ([8f07ed6](https://www..com/googleapis/java-spanner/commit/8f07ed6b44f9c70f56b9ee2e4505c40385337ca7))
* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.8.6 ([googleapis#374](https://www..com/googleapis/java-spanner/issues/374)) ([6f47b8a](https://www..com/googleapis/java-spanner/commit/6f47b8a759643f772230df0c2e153338d44f70ce))
* update dependency org.openjdk.jmh:jmh-core to v1.24 ([googleapis#375](https://www..com/googleapis/java-spanner/issues/375)) ([94f568c](https://www..com/googleapis/java-spanner/commit/94f568cf731ba22cac7f0d898d7776a3cc2c178f))
* update dependency org.openjdk.jmh:jmh-core to v1.25 ([googleapis#382](https://www..com/googleapis/java-spanner/issues/382)) ([ec7888e](https://www..com/googleapis/java-spanner/commit/ec7888e1d62cf800bf6ad166d242e89443ddc7aa))
* update dependency org.openjdk.jmh:jmh-generator-annprocess to v1.25 ([googleapis#376](https://www..com/googleapis/java-spanner/issues/376)) ([8ffdc48](https://www..com/googleapis/java-spanner/commit/8ffdc481e15901f78eac592bd8d4bef33ac3378a))
---


This PR was generated with [Release Please](https://.com/googleapis/release-please).
ansh0l pushed a commit to ansh0l/java-spanner that referenced this pull request Nov 10, 2022
This is an auto-generated regeneration of the .pb.go files by
cloud.google.com/go/internal/gapicgen. Once this PR is submitted, genmgr will
update the corresponding CL at gocloud to depend on the newer version of
go-genproto, and assign reviewers. Whilst this or any regen PR is open in
go-genproto, gapicgen will not create any more regeneration PRs or CLs. If all
regen PRs are closed, gapicgen will create a new set of regeneration PRs and
CLs once per night.

If you have been assigned to review this CL, please:

- Ensure that CI is passing. If it's failing, it requires your manual attention.
- Approve and submit this PR if you believe it's ready to ship. That will prompt
  genmgr to assign reviewers to the gocloud CL.

Corresponding gocloud CL: https://code-review.googlesource.com/c/gocloud/+/56190
rajatbhatta pushed a commit to rajatbhatta/java-spanner that referenced this pull request Nov 17, 2022
This PR was generated using Autosynth. 🌈

Synth log will be available here:
https://source.cloud.google.com/results/invocations/0a285685-944f-4f44-84fb-761c5c361020/targets

- [ ] To automatically regenerate this PR, check this box.

Source-Link: googleapis/synthtool@b416a7b
Sign up for free to join this conversation on . Already have an account? Sign in to comment
cla: yesThis human has signed the Contributor License Agreement.
None yet

Successfully merging this pull request may close these issues.