ENH: explicit filters parameter in pd.read_parquet by mrastgoo · Pull Request #53212 · pandas-dev/pandas ·

closes DOC: Document the filters argument in read_parquet #52238 (Replace xxxx with the issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

added filters as a new parameter in pd.read_parquet

Instead since we still support both engines, I think linking to the docs for both fastparquet and pyarrow in **kwargs is sufficient #52238 (comment)

Given that the filters keyword is supported by both engines and works the same, I think it would certainly be useful to explicitly document this keyword, and not hide it in the description of the kwargs. We actually already do the same for the columns keyword.

@mroeschke

@mroeschke OK with adding this?

Okay yeah that makes sense to add this then

Could you add a test for filters for both engines?

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@mrastgoo

@mrastgoo would you have time to add a test?

We have an existing test for the columns keyword (test_read_columns), so could add an equivalent test_read_filters test case next to it. Something like:

    def test_read_filters(self, engine, tmp_path):
        df = pd.DataFrame({"int": list(range(4)), "part": list("aabb"),})

        expected = pd.DataFrame({"int": [0, 1]})
        check_round_trip(
            df,
            engine,
            path=tmp_path,
            expected=expected,
            write_kwargs={"partition_cols": ["part"]},
            read_kwargs={"filters": [("string", "==", "a")], "columns":["int"]},
            repeat=1,
        )

I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the repeat=1 is needed to not add additional files to the directory when writing a second time (with partition_cols, it does not just overwrite the single file).

@jorisvandenbossche

@jorisvandenbossche , yes I would do it asap, sorry for the delay in this issue and thanks for the info I will take that into account.

I just pushed for the test.
I changed the following line for the test

read_kwargs={"filters": [("part", "==", "a")], "columns":["int"]},

the error which was raised was not very clear for missing columns, which is not a pandas issue, but thought to mention it.

pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(string) in int: int64
part: dictionary<values=string, indices=int32, ordered=0>

Could you merge in main once more?

mroeschke · 2023-08-01T20:09:47Z

pandas/io/parquet.py

@@ -483,6 +489,7 @@ def read_parquet(
 path: FilePath | ReadBuffer[bytes],
 engine: str = "auto",
 columns: list[str] | None = None,
+filters: list[tuple] | list[list[tuple]] | None = None,


Could you put this as the last argument? (before **kwargs)?

mroeschke · 2023-08-01T20:10:18Z

pandas/io/parquet.py

+Using this argument will NOT result in row-wise filtering of the final
+partitions unless ``engine="pyarrow"`` is also specified. For
+other engines, filtering is only performed at the partition level, that is,
+to prevent the loading of some row-groups and/or files.


Could you add a .. versionadded:: 2.1.0 at the end?

Also needs a whatsnew entry in 2.1.0.rst in the IO section

@mrastgoo

Thanks @mrastgoo

filters parameters in pd.read_parqeut

db9a456

jorisvandenbossche changed the title ~~ENH:filters parameters in pd.read_parqeut~~ ENH: explicit filters parameter in pd.read_parqeut May 13, 2023

jorisvandenbossche added Docs IO Parquetparquet, featherlabels May 13, 2023

linter

4624bee

mrastgoo changed the title ~~ENH: explicit filters parameter in pd.read_parqeut~~ ENH: explicit filters parameter in pd.read_parquet May 14, 2023

docstring validation

9b4439d

mroeschke requested changes May 15, 2023
View reviewed changes

mroeschke reviewed May 30, 2023
View reviewed changes

-actions bot added the Stale label Jun 30, 2023

jorisvandenbossche removed the Stale label Jul 4, 2023

mrastgoo added 2 commits July 5, 2023 20:08

Merge branch 'main' into issue_52238

a5301b5

test for filter args in pd.read_parquet

4e94179

black

36cbfe2

Merge branch 'main' into issue_52238

cd35718

mroeschke reviewed Aug 1, 2023
View reviewed changes

mrastgoo added 2 commits August 2, 2023 09:53

addressing reviews

d324e1b

Merge branch 'main' into issue_52238

05d4677

mroeschke added this to the 2.1 milestone Aug 2, 2023

mroeschke approved these changes Aug 2, 2023
View reviewed changes

mroeschke merged commit 7cbf949 into pandas-dev:main Aug 2, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mroeschke Aug 1, 2023

Uh oh!

mroeschke Aug 1, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mroeschke Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

mroeschke Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!