Conversation

SanjithChockan

DataFrame.attrs seems to not be attached to pyarrow schema metadata as it is experimental. Added it to the pyarrow table (schema.metadata) so it persists when the parquet file is read back. One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.

@SanjithChockanSanjithChockan changed the title Parquet metadata persistence Parquet metadata persistence of DataFrame.attrs Aug 1, 2023
@xiki-tempula

On one hand, this is not ideal as the round trip is not complete. On the other hand, in the current implementation, if the column name of a pandas dataframe is not a string, it will be converted to a string, when converting to a parquet file so is still consistent with other behaviours.

Similar to Duplicate column names and non-string columns names are not supported.

Several caveats.

Duplicate column names and non-string columns names are not supported.

, we just need to add a note saying that non-string attrs keys are not supported and will will be converted to a string.

@mroeschkemroeschke added IO Parquetparquet, feathermetadata_metadata, .attrslabels Aug 1, 2023
@SanjithChockan

On one hand, this is not ideal as the round trip is not complete.
I'm not sure I understand. Could you explain this?

@xiki-tempula

One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.
So I would imagine that
if I do

df.attrs = {1:1}
df.to_parquet('test.p')
new_df = pd.read_parquet('test.p')

The new_df.attrs ({'1':1}) would not be the same as the df.attrs ({1:1}). This is what I mean by round trip, where you go to parquet file then go back.

@SanjithChockan

oh yeah. Can't really think of another approach other than typecasting keys to an int if possible but doesn't seem feasible.

@xiki-tempula

@SanjithChockan I think it would be fine. The key to the attrs is not the only thing that would be converted to string when saving to parquet file.

@SanjithChockan

I'll fix the failing checks and wait for someone else to review to see if this approach is okay

@@ -176,8 +176,8 @@ Other enhancements
- Performance improvement in :func:`concat` with homogeneous ``np.float64`` or ``np.float32`` dtypes (:issue:`52685`)
- Performance improvement in :meth:`DataFrame.filter` when ``items`` is given (:issue:`52941`)
- Reductions :meth:`Series.argmax`, :meth:`Series.argmin`, :meth:`Series.idxmax`, :meth:`Series.idxmin`, :meth:`Index.argmax`, :meth:`Index.argmin`, :meth:`DataFrame.idxmax`, :meth:`DataFrame.idxmin` are now supported for object-dtype objects (:issue:`4279`, :issue:`18021`, :issue:`40685`, :issue:`43697`)
- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)
- :meth:`DataFrame.to_parquet` and :func:`read_parquet` will now write and read ``attrs`` respectively (:issue:`54346`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@mroeschkemroeschke added this to the 2.1 milestone Aug 3, 2023
@mroeschkemroeschke merged commit 152595c into pandas-dev:main Aug 3, 2023
@mroeschke

Thanks @SanjithChockan

@martindurant

fastparquet here - would have appreciated at least some notification of this.

dask/fastparquet#900

@mroeschke

Ah sorry @martindurant. The original request mentioned pyarrow so fastparquet slipped the review. Happy to have PRs to add this to the fastparquet engine too!

@aufdenkampe

@mroeschke, does this PR include the ability to read/write column-level metadata from pandas.Series.attrs? This feature is particularly important to environmental data scientists who need to keep track of column/variable metadata fields such as:

  • long name
  • description
  • units

It's also a data sharing requirement of FAIR Data Principles.

@mroeschke

I believe not, no. This PR only supports attrs from DataFrame

Sign up for free to join this conversation on . Already have an account? Sign in to comment
IO Parquetparquet, feathermetadata_metadata, .attrs
None yet

Successfully merging this pull request may close these issues.