Closed
@msk-midas

Description

This has been communicated with google cloud support, case 26471342, and they have requested I open an issue here

Environment details

  1. API: Spanner
  2. OS type and version: Debian 8
  3. Java version: 1.8.0_162
  4. google-cloud-java version(s): 3.3.0 (also reproduced on 3.0.5)

Steps to reproduce

We have a loop running in one of our coorporate test VMs that updates a row for a "last checkin made by machine" every ten seconds. Let it run and randomly encounter non-recoverable freezes. (the longest one we let run was over the holidays, 20 days frozen thread). Reproduced on two separate machines multiple times, average executions before encountering freeze is around 3000 (sometimes 800, sometimes 6000). On one third machine we have not encountered it at all (14000 successful attempts and counting). The data written is minimal, essentially just name and timestamp. Normal communication time is a question of milliseconds, so ten seconds is ample time between calls.

Code example

rough pseudo (this is part of a very large framework, but this is the item it happens on - regular scheduled executor service at 10 second ticks to run, fixed rate not fixed delay)

TransactionRunner transactionRunner = databaseClient.readWriteTransaction();
transactionRunner.run(transaction -> {
                transaction.readRow(tableName, Key.of(getHashForId(id), getTableId(), id, sid), Collections.singleton("json"));
                }

Thread dump

"com.king.spanner.Availability_monitor-1" googleapis/google-cloud-java#91 prio=5 os_prio=0 tid=0x00007f3a00d42800 nid=0x57d4 waiting on condition [0x00007f39984a1000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000f272eff0> (a com.google.api.core.AbstractApiFuture$InternalSettableFuture)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:537)
	at com.google.api.core.AbstractApiFuture.get(AbstractApiFuture.java:56)
	at com.google.cloud.spanner.TransactionRunnerImpl$TransactionContextImpl.getTransactionSelector(TransactionRunnerImpl.java:437)
	at com.google.cloud.spanner.AbstractReadContext$2.startStream(AbstractReadContext.java:739)
	at com.google.cloud.spanner.AbstractResultSet$ResumableStreamIterator.computeNext(AbstractResultSet.java:1020)
	at com.google.cloud.spanner.AbstractResultSet$ResumableStreamIterator.computeNext(AbstractResultSet.java:898)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
	at com.google.cloud.spanner.AbstractResultSet$GrpcValueIterator.ensureReady(AbstractResultSet.java:260)
	at com.google.cloud.spanner.AbstractResultSet$GrpcValueIterator.getMetadata(AbstractResultSet.java:236)
	at com.google.cloud.spanner.AbstractResultSet$GrpcResultSet.next(AbstractResultSet.java:118)
	at com.google.cloud.spanner.AbstractReadContext.consumeSingleRow(AbstractReadContext.java:755)
	at com.google.cloud.spanner.AbstractReadContext.readRow(AbstractReadContext.java:457)

Any additional information below

We have repeatedly reproduced this over the regular public APIs, but we have the impression that it happens more often (for reproduction) if you add

Builder builder = SpannerOptions.newBuilder();
builder.setHost("https://restricted.googleapis.com");
builder.setInterceptorProvider(() -> Collections.singletonList(new ClientInterceptor() {
                public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
                    return next.newCall(method, callOptions.withAuthority("spanner.googleapis.com"));
                }
            }));