Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.The existing implementation does a lexsort over (values, ids) for each column which gets very expensive. By using grouper._get_splitter()._sorted_data, we only sort by ids once, then we cheaply iterate over groups and do group-by-group argsorts. This is roughly equivalent to
I tried an implementation that used _python_apply_general and ripped out the cython group_quantile entirely and found it had some rough edges with a) axis=1, b) dtypes where GroupBy.quantile behaves different from DataFrame.quantile (xref #51424), and it performed poorly as ngroups becomes big. With a large number of rows though, that performs better than this. I speculate that np.percentile is doing something more efficiently than our cython group_quantile, but haven't figured it out yet.
Some timings!
Did this for each of main, this PR, and the pure-python implementation described above. The results:
on main I had to interrupt (with ctl-z, not ctl-c!) the largest cases.
This out-performs main in all cases. The pure-python version out-performs this in small-ngroups cases and large-nrows cases, but suffers pretty dramatically in the opposite cases.
Potential caveats:
Potential downsides vs main:
We could plausibly keep multiple implementations and dis based on sizes, but im not sure groupby.quantile is important enough to really merit that. So which version to use really comes down to what sized cases we think are the most common.