Conversation
Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy. Finally, we bypass validation the validation logic by initializing NDArrayBacked instead of StringArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this compatible with the minimal pyarrow version we are supporting?
Good point, I hadn't considered this. No - afaict this code requires pyarrow 3.0 ( I understand this performance issue alone may not be sufficient reason to bump a version, but, in general, what would be the requirements for that? There's just 5 months between the two releases. |
I think it's possible to implement something that's already a little better than what's on 1.4 without requiring pyarrow 3. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Excellent, I'll revisit this soon! Edit: I recently found that pyarrow's |
I think this is ready then - I just changed the whatsnew edit, the code change stays the same. I also briefly checked what I thought might have improved performance across the other dtypes, but this wasn't so. There may be some parts where one could switch to pyarrow concatenation, but those that I checked (integer numerical) didn't yield performance improvements (and may result in some memory overhead). |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
@timlod there is a merge conflict here but since the rc is now cut this would probably need the release note moved to 2.1 |
Thanks for sticking with this @timlod |
Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy.
Finally, bypass validation logic (unneeded as validated on parquet write) by initializing NDArrayBacked instead of StringArray.
This removes most of the performance overhead seen in #47345. There is still a slight overhead when comparing to
object
string arrays because of None -> NA conversion. I found that leaving that out still results in NA types in the example I gave (and would actually improve performance over theobject
case), but this is not consistent and thus conversion is left in.use_nullable_dtypes=True
inread_parquet
slows performance on large dataframes #47345doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.