Conversation

mrastgoo

added filters as a new parameter in pd.read_parquet

@jorisvandenbosschejorisvandenbossche changed the title ENH:filters parameters in pd.read_parqeut ENH: explicit filters parameter in pd.read_parqeut May 13, 2023
@jorisvandenbosschejorisvandenbossche added Docs IO Parquetparquet, featherlabels May 13, 2023
@mrastgoomrastgoo changed the title ENH: explicit filters parameter in pd.read_parqeut ENH: explicit filters parameter in pd.read_parquet May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead since we still support both engines, I think linking to the docs for both fastparquet and pyarrow in **kwargs is sufficient #52238 (comment)

@jorisvandenbossche

Given that the filters keyword is supported by both engines and works the same, I think it would certainly be useful to explicitly document this keyword, and not hide it in the description of the kwargs. We actually already do the same for the columns keyword.

@jorisvandenbossche

@mroeschke OK with adding this?

@mroeschke

Okay yeah that makes sense to add this then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for filters for both engines?

@github-actions

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions-actions bot added the Stale label Jun 30, 2023
@jorisvandenbossche

@mrastgoo would you have time to add a test?

We have an existing test for the columns keyword (test_read_columns), so could add an equivalent test_read_filters test case next to it. Something like:

    def test_read_filters(self, engine, tmp_path):
        df = pd.DataFrame({"int": list(range(4)), "part": list("aabb"),})

        expected = pd.DataFrame({"int": [0, 1]})
        check_round_trip(
            df,
            engine,
            path=tmp_path,
            expected=expected,
            write_kwargs={"partition_cols": ["part"]},
            read_kwargs={"filters": [("string", "==", "a")], "columns":["int"]},
            repeat=1,
        )

I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the repeat=1 is needed to not add additional files to the directory when writing a second time (with partition_cols, it does not just overwrite the single file).

@mrastgoo

@jorisvandenbossche , yes I would do it asap, sorry for the delay in this issue and thanks for the info I will take that into account.

@mrastgoo

I just pushed for the test.
I changed the following line for the test

read_kwargs={"filters": [("part", "==", "a")], "columns":["int"]},

the error which was raised was not very clear for missing columns, which is not a pandas issue, but thought to mention it.

pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(string) in int: int64
part: dictionary<values=string, indices=int32, ordered=0>

@mroeschke

Could you merge in main once more?

@@ -483,6 +489,7 @@ def read_parquet(
path: FilePath | ReadBuffer[bytes],
engine: str = "auto",
columns: list[str] | None = None,
filters: list[tuple] | list[list[tuple]] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this as the last argument? (before **kwargs)?

Using this argument will NOT result in row-wise filtering of the final
partitions unless ``engine="pyarrow"`` is also specified. For
other engines, filtering is only performed at the partition level, that is,
to prevent the loading of some row-groups and/or files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a .. versionadded:: 2.1.0 at the end?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs a whatsnew entry in 2.1.0.rst in the IO section

@mroeschkemroeschke added this to the 2.1 milestone Aug 2, 2023
@mroeschkemroeschke merged commit 7cbf949 into pandas-dev:main Aug 2, 2023
@mroeschke

Thanks @mrastgoo

Sign up for free to join this conversation on . Already have an account? Sign in to comment
Docs IO Parquetparquet, feather
None yet

Successfully merging this pull request may close these issues.

DOC: Document the filters argument in read_parquet