Conversation

tswast

In internal issue 182204971, as customer is encountering a 403 error for missing permissions to set policy tags on a table when trying to append a dataframe to a table with load_table_from_dataframe. This is because we get the schema from the table and then pass it back to the API. We only need to set the field names (and maybe type + mode) in this case.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal bug 182204971 🦕

@product-auto-labelproduct-auto-label bot added the api: bigqueryIssues related to the googleapis/python-bigquery API.label Mar 17, 2021
@google-clagoogle-cla bot added the cla: yesThis human has signed the Contributor License Agreement.label Mar 17, 2021
@tswasttswast changed the title WIP: fix: don't set policy tags in load job from dataframe fix: avoid policy tags 403 error in load_table_from_dataframe Mar 17, 2021
@tswasttswast marked this pull request as ready for review March 18, 2021 14:34
@tswasttswast requested review from a team, stephaniewang526, shollyman and plamut and removed request for a team March 18, 2021 14:34
}
if mode is not None:
self._properties["mode"] = mode.upper()
if description is not _DEFAULT_VALUE:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shollyman This is one of the key changes: we no longer set the resource value for "description" if it's not explicitly set.

We already omit policy_tags from the resource if none (though arguably it should get the same treatment so that someone can unset policy tags from Python)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My default inclination would be for special handling for None values to happen at the places where it's significant, like when calling tables.update. It's also the case that schema fields can't be manipulated individually, so perhaps I'm simply just not thinking this through properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called that out as a possibility in #558, but that'd require updating our field mask logic to support sub-fields, which gets into some hairy string parsing (perhaps not all that hairy, as it could be as simple as split on '.', but is definitely a departure from what we've been doing).

Also, it might mean that we'd have to introduce a field mask to our load job methods. Based on the error message we're seeing, it sounds like it's possible to make updates to fields like policy tags from a load job.

field for field in table.schema if field.name in columns_and_indexes
# Field description and policy tags are not needed to
# serialize a data frame.
SchemaField(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual bug fix. Rather than populate all properties of schema field from the table schema, just populate the minimum we need to convert to parquet/CSV and then upload

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to revisit this for parameterization constraints, but that's a problem for future Tim.

Also, check that sent schema matches DataFrame order, not table order
}
if mode is not None:
self._properties["mode"] = mode.upper()
if description is not _DEFAULT_VALUE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My default inclination would be for special handling for None values to happen at the places where it's significant, like when calling tables.update. It's also the case that schema fields can't be manipulated individually, so perhaps I'm simply just not thinking this through properly.

field for field in table.schema if field.name in columns_and_indexes
# Field description and policy tags are not needed to
# serialize a data frame.
SchemaField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to revisit this for parameterization constraints, but that's a problem for future Tim.

@tswasttswast merged commit 84e646e into googleapis:master Mar 19, 2021
@tswasttswast deleted the b182204971-dataframe-policy-tags branch March 19, 2021 17:54
@tswast

For posterity, here is the error you get if you include policyTags in the schema, even if they aren't actually changed:

[1] Reason: 403 POST
https://bigquery.googleapis.com/upload/bigquery/v2/projects/YOUR-PROJECT-ID/jobs?uploadType=resumable:
Access Denied: Taxonomy projects/REDACTED/locations/eu/taxonomies/REDACTED: 
User does not have permission to get taxonomy projects/REDACTED/locations/eu/taxonomies/REDACTED\n'

Sign up for free to join this conversation on . Already have an account? Sign in to comment
api: bigqueryIssues related to the googleapis/python-bigquery API.cla: yesThis human has signed the Contributor License Agreement.
None yet

Successfully merging this pull request may close these issues.