question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trying to load data from BigQuery to Spark Dataframe using query option

See original GitHub issue

When i execute the following code, I expect the filter pushdown to work.

reader = self.spark.read.format("bigquery") \
                .option('parentProject', self.project_id) \
                .option('viewsEnabled', True) \
                .option("materializationDataset", "tmp") \
                .option("query", f"select * from `{table_id}`")\
                .option("filter", "CAST(`_TABLE_SUFFIX` AS STRING) = '20170405'")

But instead I get an error Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for bigquery-rollup.tmp._bqc_c372ae3bcf3a46019600500ebacdf15d is invalid. Filter is ‘(CAST(_TABLE_SUFFIX AS STRING) = ‘20170405’)’

When I run the same without operation without “query” option, it works

reader = self.spark.read \
                .format('bigquery') \
                .option('parentProject', self.project_id) \
                .option('dataset', self.dataset_id) \
                .option('table', self.table_id) \
                .option('filter', "CAST(`_TABLE_SUFFIX` AS STRING) = '20170405'")

Don’t you have support for filters in query based read? Please suggest. Thanks.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
davidrabinowitzcommented, Feb 8, 2022

My guess is that you are reading from a table with a dated suffix as way of partitioning. When loading from query you actually run the query on BigQuery and the materialized result is save in a temporary table (bigquery-rollup.tmp._bqc_c372ae3bcf3a46019600500ebacdf15d in our case), which you cannot use the filter on it.

As an alternative, you can combine the query and the filter:

reader = self.spark.read.format("bigquery") \
                .option('parentProject', self.project_id) \
                .option('viewsEnabled', True) \
                .option("materializationDataset", "tmp") \
                .load(f"SELECT * FROM `{table_id}` WHERE CAST(`_TABLE_SUFFIX` AS STRING) = '20170405'")
0reactions
akashgangulyhfcommented, Feb 9, 2022

Thanks for all the info. Was really very helpful. Have a good one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use the BigQuery connector with Spark - Google Cloud
The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all...
Read more >
Spark BigQuery Connector: Easy Steps to Integrate, Optimize ...
The connector can read Google BigQuery tables into Spark DataFrames and write DataFrames back to BigQuery. This is accomplished by ...
Read more >
Big Query Sample Notebook (Scala) - Databricks
This example shows how you can run SQL against BigQuery and load the result into a DataFrame. This is useful when you want...
Read more >
Apache Spark BigQuery Connector — Optimization tips ...
Using the Apache Spark BigQuery connector, which is built on top of the BigQuery Storage API and BigQuery API, you can now treat...
Read more >
Big Query Sample Notebook - Databricks - Microsoft Learn
This example shows how you can run SQL against BigQuery and load the result into a DataFrame. This is useful when you want...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found