Add batch_size(batch_size) to __find_in_batches (Mongoid) by sylvain-8422 · Pull Request #1036 · elastic/elasticsearch-rails ·

Add .batch_size(batch_size) to #__find_in_batches (Mongoid).

Fixes #1037 .

Although .each_slice(batch_size) is useful in order to limit how many documents are sent to Elasticsearch at a time, it does nots limit the batch size of MongoDB's getMore commands.

By default, iterating over a MongoDB collection will first return 101 documents, and then subsequent batches of 16 MiB :

https://www.mongodb.com/docs/manual/tutorial/iterate-a-cursor/#cursor-batches

For example, a MongoDB collection containing documents averaging 1 KiB might return more than 16,000 documents at a time.

Although Mongoid claims in its documentation a default batch size of 1,000 documents, it does not seem to be the case.

Also, Mongoid's .no_timeout is broken right now and does nothing:

mongodb/mongo-ruby-driver#2557

It is now likely that more than 10 minutes go by between two getMore commands and that the MongoDB cursor expires.

Adding .batch_size(batch_size) to the query makes sure that MongoDB documents are retrieved at the same rate as they are processed and indexed in Elasticsearch, and allow applications affected by the .no_timeout issue to reduce the batch size to avoid cursor timeouts.

@shashankjo

@shashankjo

Same simple change as before, but I fixed the conflict created by whitespace changes in main.

From ef8985e to aa38a1b.

sylvain-8422 mentioned this pull request Jul 11, 2022
Batch size is ignored when fetching documents from MongoDB #1037
Open

shashankjo approved these changes Jul 4, 2023
View reviewed changes

Add batch_size(batch_size) to __find_in_batches (Mongoid)

aa38a1b

sylvain-8422 force-pushed the mongoid-batch-size branch from ef8985e to aa38a1b Compare July 6, 2023 18:24

shashankjo approved these changes Jul 7, 2023
View reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!