REGR: GroupBy.indices no longer includes unobserved categories · Issue #38642 · pandas-dev/pandas ·

@mroeschke

Does anybody know if this was an intentional change? (I don't directly find something about it in the whatsnew)

In [9]: pd.__version__ 
Out[9]: '1.0.5'

In [10]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)}) 

In [11]: gb = df.groupby("key")   

In [12]: list(gb.indices)  
Out[12]: ['a', 'b', 'c', 'd']

vs

In [1]: pd.__version__
Out[1]: '1.3.0.dev0+92.ga2d10ba88a'

In [2]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})

In [3]: gb = df.groupby("key")

In [4]: list(gb.indices)
Out[4]: ['b']

This already changed in pandas 1.1, so not a recent change.

The consequence of this is that iterating over gb vs iterating over gb.indices is not consistent anymore.

cc @mroeschke @rhshadrach

The text was updated successfully, but these errors were encountered:

So for example, those two APIs still return all values (both for pandas 1.0 and master):

In [6]: gb.groups
Out[6]: {'a': [], 'b': [0, 1, 2, 3, 4], 'c': [], 'd': []}

In [7]: [key for key, group in gb]
Out[7]: ['a', 'b', 'c', 'd']

So it seems .indices should be consistent with it?

This may have been unintentionally changed by me in https://.com/pandas-dev/pandas/pull/36911/files

I looked into this, I think the new case is more consistent maybe?
on 1.0.5 and master we get:

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "key1": [1,1,1,2,2,3], "col": range(6)})

gb = df.groupby(["key", "key1"])
gb.groups
gb.indices

returned

{('b', 1): Int64Index([0, 1, 2], dtype='int64'), ('b', 2): Int64Index([3, 4], dtype='int64'), ('c', 3): Int64Index([5], dtype='int64')}
{('b', 1): array([0, 1, 2]), ('b', 2): array([3, 4]), ('c', 3): array([5])}

While a one dimensional group key returned what you showed above. The missing categories case would be tricky to handle with multidimensional keys. Maybe it would be better to remove unused categories from groups too? Or should the one-dimensional case be special here?

@mroeschke

@mroeschke no that was not the reason. I think this was caused by c4226d4

@phofl

Thanks for confirming @phofl

Addition: We are no longer running through there since #36842

@phofl

@phofl thanks for looking at it!

I looked into this, I think the new case is more consistent maybe?

Indeed for multiple keys, we seem to not include unobserved categories. But, here both .indices as .groups (and GroupBy.__iter__ do that), so they are also consistent with each other.
While for a single key, while maybe inconsistent or not with the multiple key case, it is now inconsistent between .indices and .groups.

So to fully make it consistent, then for example also .groups and GroupBy.__iter__ should change for the single key case. But that would also be a breaking change ..
And I am not fully sure that is necessarily the good change, since we actually include the unobserved categories in the output if you do eg an aggregation on the groupby object, so then I would expect to include those empty groups as well when iterating over the groupby object?

Passing observed has no impact on .indices, but maybe it should? I think this behavior would not be surprising:

>>>df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})
>>> gb = df.groupby("key", observed=True)
>>> list(gb.indices)
['b']
>>> gb = df.groupby("key", observed=False)
>>> list(gb.indices)
['a', b', 'c', 'd']

@jorisvandenbossche

Have to correct myself, this was changed by #36842

@jorisvandenbossche When testing this on 1.1.0 and 1.1.5 I get

{'a': [], 'b': [0, 1, 2, 3, 4], 'c': [5], 'd': []}
{'b': array([0, 1, 2, 3, 4]), 'c': array([5])}

both times.

Edit: Changed the example a bit.

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "col": range(6)})

@mroeschke

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

And unlike what I said earlier (I thought it was only working on 1.0, and not in 1.1.x), this actually only changed from 1.1.3 to 1.1.4.

@mroeschke

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

can confirm, first bad commit: [345efdd] BUG: RollingGroupby not respecting sort=False (#36911)

Hm just looked at the pr numbers, not when they were merged.

Nevertheless, we have to change both commits to get the original result, because the code path from #36911 is currently not used on master.

jorisvandenbossche added Groupby RegressionFunctionality that used to work in a prior pandas versionlabels Dec 22, 2020

jorisvandenbossche mentioned this issue Dec 22, 2020
Test pandas 1.1.x / 1.2.0 releases and pandas nightly dask/dask#6996
Merged

phofl mentioned this issue Dec 22, 2020
BUG: Fix regression for groupby.indices in case of unused categories #38649
Merged
4 tasks

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 23, 2020
code sample for pandas-dev#38642

bfc2da2

jorisvandenbossche added this to the 1.2.1 milestone Dec 28, 2020

jreback closed this as completed in #38649 Dec 29, 2020

ggold7046 mentioned this issue Aug 10, 2023
Modified doc/make.py to run sphinx-build -b linkcheck #54265
Merged
5 tasks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!