Reading BigQuery table in PySpark outside GCP
See original GitHub issueMy server runs on AWS.
I follow the instructions here and this tutorial script, but I get Py4JJavaError
caused by missing project ID
Caused by: java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment. Please set a project ID using the builder.
at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:266)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:81)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:30)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions$Builder.build(BigQueryOptions.java:76)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.getDefaultInstance(BigQueryOptions.java:136)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.$lessinit$greater$default$2(BigQueryRelationProvider.scala:30)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:40)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 24 more
My python script looks like this:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('spark.jars', '/path/to/bigquery_spark-bigquery-latest.jar')
.getOrCreate()
)
spark.conf.set("credentialsFile", "/path/to/credentials.json")
df = (
spark.read
.format('bigquery')
.option('project', 'myProject')
.option('table', 'myTable')
.load()
)
Any idea how could I fix the missing project ID error?
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (4 by maintainers)
Top Results From Across the Web
Use the BigQuery connector with Spark - Google Cloud
The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all...
Read more >Reading BigQuery table in PySpark | by Jessica Le
In this post, let's simply read the data from Google Cloud BigQuery table using BigQuery connector with Spark on my local Macbook terminal....
Read more >How to access BigQuery using Spark which is running outside ...
For reading regular tables there's no need for bigquery.tables.create permission. However, the code sample you've provided hints that the ...
Read more >Spark - Read from BigQuery Table - Kontext
Create a script file named pyspark-bq.py in your home folder of the Cloud Shell VM. The file content looks like the following: #!/usr/bin/python...
Read more >Read and Write to BigQuery with Spark and IDE ... - LinkedIn
There are some information here. However, with the caveat that it describes accessing BigQuery through Dataproc servers in GCP. Well, that is ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @FurcyPin,
I would recommend specifying parent (billing) project option as well (
.option("parentProject", "billing-project-id")
).@KhanhTrinh1703 Thanks for a quick response!
I think that suddenly I figured out whats going on. After adding
parentProject
option I was authenticated to bigquery but not into google cloud storage.I needed to pass my key also into hadoopConfiguration:
I only wonder if there is an option to write with temporaryGcsBucket without the need to pass this config Thank you for your time. Take care!