Trying to load data from BigQuery to Spark Dataframe using query option
See original GitHub issueWhen i execute the following code, I expect the filter pushdown to work.
reader = self.spark.read.format("bigquery") \
.option('parentProject', self.project_id) \
.option('viewsEnabled', True) \
.option("materializationDataset", "tmp") \
.option("query", f"select * from `{table_id}`")\
.option("filter", "CAST(`_TABLE_SUFFIX` AS STRING) = '20170405'")
But instead I get an error
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for bigquery-rollup.tmp._bqc_c372ae3bcf3a46019600500ebacdf15d is invalid. Filter is ‘(CAST(_TABLE_SUFFIX
AS STRING) = ‘20170405’)’
When I run the same without operation without “query” option, it works
reader = self.spark.read \
.format('bigquery') \
.option('parentProject', self.project_id) \
.option('dataset', self.dataset_id) \
.option('table', self.table_id) \
.option('filter', "CAST(`_TABLE_SUFFIX` AS STRING) = '20170405'")
Don’t you have support for filters in query based read? Please suggest. Thanks.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Use the BigQuery connector with Spark - Google Cloud
The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all...
Read more >Spark BigQuery Connector: Easy Steps to Integrate, Optimize ...
The connector can read Google BigQuery tables into Spark DataFrames and write DataFrames back to BigQuery. This is accomplished by ...
Read more >Big Query Sample Notebook (Scala) - Databricks
This example shows how you can run SQL against BigQuery and load the result into a DataFrame. This is useful when you want...
Read more >Apache Spark BigQuery Connector — Optimization tips ...
Using the Apache Spark BigQuery connector, which is built on top of the BigQuery Storage API and BigQuery API, you can now treat...
Read more >Big Query Sample Notebook - Databricks - Microsoft Learn
This example shows how you can run SQL against BigQuery and load the result into a DataFrame. This is useful when you want...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
My guess is that you are reading from a table with a dated suffix as way of partitioning. When loading from query you actually run the query on BigQuery and the materialized result is save in a temporary table (
bigquery-rollup.tmp._bqc_c372ae3bcf3a46019600500ebacdf15d
in our case), which you cannot use the filter on it.As an alternative, you can combine the query and the filter:
Thanks for all the info. Was really very helpful. Have a good one.