question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading BigQuery table in PySpark outside GCP

See original GitHub issue

My server runs on AWS. I follow the instructions here and this tutorial script, but I get Py4JJavaError caused by missing project ID

Caused by: java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the builder.
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:266)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:81)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:30)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions$Builder.build(BigQueryOptions.java:76)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.getDefaultInstance(BigQueryOptions.java:136)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.$lessinit$greater$default$2(BigQueryRelationProvider.scala:30)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:40)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
	... 24 more

My python script looks like this:

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName('bq')
    .master('local[4]')
    .config('spark.jars', '/path/to/bigquery_spark-bigquery-latest.jar')
    .getOrCreate()
)
spark.conf.set("credentialsFile", "/path/to/credentials.json")

df = (
    spark.read
    .format('bigquery')
    .option('project', 'myProject')
    .option('table', 'myTable')
    .load()
)

Any idea how could I fix the missing project ID error?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:17 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
xjrk58commented, Nov 27, 2019

Hi @FurcyPin,

I would recommend specifying parent (billing) project option as well (.option("parentProject", "billing-project-id")).

3reactions
moscichcommented, Dec 24, 2019

@KhanhTrinh1703 Thanks for a quick response!

I think that suddenly I figured out whats going on. After adding parentProject option I was authenticated to bigquery but not into google cloud storage.

I needed to pass my key also into hadoopConfiguration:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.sparkContext._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.enable", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", "/Users/marekmoscichowski/Documents/Dev/pyspark/key2.json")

I only wonder if there is an option to write with temporaryGcsBucket without the need to pass this config Thank you for your time. Take care!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use the BigQuery connector with Spark - Google Cloud
The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all...
Read more >
Reading BigQuery table in PySpark | by Jessica Le
In this post, let's simply read the data from Google Cloud BigQuery table using BigQuery connector with Spark on my local Macbook terminal....
Read more >
How to access BigQuery using Spark which is running outside ...
For reading regular tables there's no need for bigquery.tables.create permission. However, the code sample you've provided hints that the ...
Read more >
Spark - Read from BigQuery Table - Kontext
Create a script file named pyspark-bq.py in your home folder of the Cloud Shell VM. The file content looks like the following: #!/usr/bin/python...
Read more >
Read and Write to BigQuery with Spark and IDE ... - LinkedIn
There are some information here. However, with the caveat that it describes accessing BigQuery through Dataproc servers in GCP. Well, that is ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found