Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading BigQuery table in PySpark outside GCP

See original GitHub issue

My server runs on AWS. I follow the instructions here and this tutorial script, but I get Py4JJavaError caused by missing project ID

Caused by: java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the builder.
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:266)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:81)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:30)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions$Builder.build(BigQueryOptions.java:76)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.getDefaultInstance(BigQueryOptions.java:136)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.$lessinit$greater$default$2(BigQueryRelationProvider.scala:30)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:40)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
	... 24 more

My python script looks like this:

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName('bq')
    .master('local[4]')
    .config('spark.jars', '/path/to/bigquery_spark-bigquery-latest.jar')
    .getOrCreate()
)
spark.conf.set("credentialsFile", "/path/to/credentials.json")

df = (
    spark.read
    .format('bigquery')
    .option('project', 'myProject')
    .option('table', 'myTable')
    .load()
)

Any idea how could I fix the missing project ID error?

Issue Analytics

State:
Created 4 years ago
Comments:17 (4 by maintainers)

Top GitHub Comments

4reactions

xjrk58commented, Nov 27, 2019

Hi @FurcyPin,

I would recommend specifying parent (billing) project option as well (.option("parentProject", "billing-project-id")).

3reactions

moscichcommented, Dec 24, 2019

@KhanhTrinh1703 Thanks for a quick response!

I think that suddenly I figured out whats going on. After adding parentProject option I was authenticated to bigquery but not into google cloud storage.

I needed to pass my key also into hadoopConfiguration:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.sparkContext._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.enable", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", "/Users/marekmoscichowski/Documents/Dev/pyspark/key2.json")

I only wonder if there is an option to write with temporaryGcsBucket without the need to pass this config Thank you for your time. Take care!

Top Results From Across the Web

Use the BigQuery connector with Spark - Google Cloud

The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all...

Reading BigQuery table in PySpark | by Jessica Le

In this post, let's simply read the data from Google Cloud BigQuery table using BigQuery connector with Spark on my local Macbook terminal....

How to access BigQuery using Spark which is running outside ...

For reading regular tables there's no need for bigquery.tables.create permission. However, the code sample you've provided hints that the ...

Spark - Read from BigQuery Table - Kontext

Create a script file named pyspark-bq.py in your home folder of the Cloud Shell VM. The file content looks like the following: #!/usr/bin/python...

Read and Write to BigQuery with Spark and IDE ... - LinkedIn

There are some information here. However, with the caveat that it describes accessing BigQuery through Dataproc servers in GCP. Well, that is ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Reading BigQuery table in PySpark outside GCP

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported

Dataflow job (BiqQuery to Elasticsearch) timeouts and aborts the execution