Labels
GroupbyFunctionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche

Does anybody know if this was an intentional change? (I don't directly find something about it in the whatsnew)

In [9]: pd.__version__ 
Out[9]: '1.0.5'

In [10]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)}) 

In [11]: gb = df.groupby("key")   

In [12]: list(gb.indices)  
Out[12]: ['a', 'b', 'c', 'd']

vs

In [1]: pd.__version__
Out[1]: '1.3.0.dev0+92.ga2d10ba88a'

In [2]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})

In [3]: gb = df.groupby("key")

In [4]: list(gb.indices)
Out[4]: ['b']

This already changed in pandas 1.1, so not a recent change.

The consequence of this is that iterating over gb vs iterating over gb.indices is not consistent anymore.

cc @mroeschke @rhshadrach

@jorisvandenbosschejorisvandenbossche added Groupby RegressionFunctionality that used to work in a prior pandas versionlabels Dec 22, 2020
@jorisvandenbossche

So for example, those two APIs still return all values (both for pandas 1.0 and master):

In [6]: gb.groups
Out[6]: {'a': [], 'b': [0, 1, 2, 3, 4], 'c': [], 'd': []}

In [7]: [key for key, group in gb]
Out[7]: ['a', 'b', 'c', 'd']

So it seems .indices should be consistent with it?

@mroeschke

This may have been unintentionally changed by me in https://.com/pandas-dev/pandas/pull/36911/files

@phofl

I looked into this, I think the new case is more consistent maybe?
on 1.0.5 and master we get:

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "key1": [1,1,1,2,2,3], "col": range(6)})

gb = df.groupby(["key", "key1"])
gb.groups
gb.indices

returned

{('b', 1): Int64Index([0, 1, 2], dtype='int64'), ('b', 2): Int64Index([3, 4], dtype='int64'), ('c', 3): Int64Index([5], dtype='int64')}
{('b', 1): array([0, 1, 2]), ('b', 2): array([3, 4]), ('c', 3): array([5])}

While a one dimensional group key returned what you showed above. The missing categories case would be tricky to handle with multidimensional keys. Maybe it would be better to remove unused categories from groups too? Or should the one-dimensional case be special here?

@phofl

@mroeschke no that was not the reason. I think this was caused by c4226d4

@mroeschke

Thanks for confirming @phofl

@phofl

Addition: We are no longer running through there since #36842

@jorisvandenbossche

@phofl thanks for looking at it!

I looked into this, I think the new case is more consistent maybe?

Indeed for multiple keys, we seem to not include unobserved categories. But, here both .indices as .groups (and GroupBy.__iter__ do that), so they are also consistent with each other.
While for a single key, while maybe inconsistent or not with the multiple key case, it is now inconsistent between .indices and .groups.

So to fully make it consistent, then for example also .groups and GroupBy.__iter__ should change for the single key case. But that would also be a breaking change ..
And I am not fully sure that is necessarily the good change, since we actually include the unobserved categories in the output if you do eg an aggregation on the groupby object, so then I would expect to include those empty groups as well when iterating over the groupby object?

@jsignell

Passing observed has no impact on .indices, but maybe it should? I think this behavior would not be surprising:

>>>df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})
>>> gb = df.groupby("key", observed=True)
>>> list(gb.indices)
['b']
>>> gb = df.groupby("key", observed=False)
>>> list(gb.indices)
['a', b', 'c', 'd']

@phofl

Have to correct myself, this was changed by #36842

@jorisvandenbossche When testing this on 1.1.0 and 1.1.5 I get

{'a': [], 'b': [0, 1, 2, 3, 4], 'c': [5], 'd': []}
{'b': array([0, 1, 2, 3, 4]), 'c': array([5])}

both times.

Edit: Changed the example a bit.

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "col": range(6)})

@jorisvandenbossche

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

And unlike what I said earlier (I thought it was only working on 1.0, and not in 1.1.x), this actually only changed from 1.1.3 to 1.1.4.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 23, 2020
@simonjayhawkins

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

can confirm, first bad commit: [345efdd] BUG: RollingGroupby not respecting sort=False (#36911)

@phofl

Hm just looked at the pr numbers, not when they were merged.

Nevertheless, we have to change both commits to get the original result, because the code path from #36911 is currently not used on master.

Sign up for free to join this conversation on . Already have an account? Sign in to comment
RegressionFunctionality that used to work in a prior pandas version
None yet

Successfully merging a pull request may close this issue.