question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Reading data from a BigQuery query" not working: dataset not parsed or provided

See original GitHub issue

I’m using

Java 8
scala 2.12
dataproc image: 2.0-debian10

In my project, i have to send query to bigquery not using spark sql. And i tried three ways but doesn’t work.

first try


var bigQueryOption = new HashMap<String, String>() {{
put("credentials", credentials);
put("viewsEnabled", "true");
}}

var dfReader = spark.read
      .options(bigQueryOption)
      .format('bigquery')

val df = dfReader
      .load("SELECT * FROM `project.dataset.table` LIMIT 5")
      .cache()

When i run this code, i got message like below.

Exception in thread "main" com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.lang.IllegalArgumentException: 'dataset' not parsed or provided.
  at com.google.cloud.spark.bigquery.v2.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:68)
  while locating com.google.cloud.spark.bigquery.SparkBigQueryConfig

1 error
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131)
        at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:76)
        at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:47)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)

second try

var bigQueryOption = new HashMap<String, String>() {{
put("credentials", credentials);
put("viewsEnabled", "true");
put("query", "SELECT * FROM `project.dataset.table` LIMIT 5");
}}

var dfReader = spark.read
      .options(bigQueryOption)
      .format('bigquery')

val df = dfReader
      .load()
      .cache()

but doesn’t work with same error.

third try

var bigQueryOption = new HashMap<String, String>() {{
put("credentials", credentials);
put("viewsEnabled", "true");
put("query", "SELECT * FROM `project.dataset.table` LIMIT 5");
}}

var dfReader = spark.read
      .options(bigQueryOption)
      .format('bigquery')

val df = dfReader
      .load("project.dataset.table")
      .cache()

This code works same as

spark.read.format('bigquery').load('project.dataset.table').cache()

which is no query option.

Is it related on java or scala version? I think exception is from parseTableId function (https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryUtil.java#L116)

I checked that “project.dataset.table” matched to QUALIFIED_TABLE_REGEX (https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryUtil.java#L56) but there’s no problem.

i’ve seen the issue(https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/330) that solved with using dataproc image 1.4 But i don’t want to downgrade image…

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
davidrabinowitzcommented, Apr 29, 2021

Please add the materializationDataset option (see here), make sure that the use has write permission to this dataset

We will fix the error message

0reactions
amar-w3wcommented, Jul 6, 2022

Thanks for the explanation @davidrabinowitz This was really helpful. I have some queries, believe someone can clarify this

Upon using materializationDataset. I can see the spark job produces some table like _bqc_* this table created with expiration which is assuring.

However, my question / concern is about the querying cost in BigQuery.

This statement was little concerning,

Important: This feature is implemented by running the query on BigQuery and 
saving the result into a temporary table, of which Spark will read the results from. 
This may add additional costs on your BigQuery account.

Would I be charged in BigQuery side upon using df (_bqc_) for my multiple aggregations or once ?

Thanks in advance 🙏

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dataproc: Errors when reading and writing data from BigQuery ...
When reading from BigQuery through a SQL query, add mandatory properties viewsEnabled=true and materializationDataset=<dataset> :
Read more >
Loading CSV data from Cloud Storage | BigQuery
Describes how to load CSV data from Cloud Storage to BigQuery. ... issues. If you use gzip compression, BigQuery cannot read the data...
Read more >
spark-bigquery-connector - Scaladex
Apache Spark SQL connector for Google BigQuery. The connector supports reading Google BigQuery tables into Spark's DataFrames, and writing DataFrames back ...
Read more >
Chapter 4. Loading Data into BigQuery - O'Reilly
BigQuery does not charge for loading data. Ingestion happens on a set of workers that is distinct from the cluster providing the slots...
Read more >
Top 12 advice on BigQuery and SQL for beginners
To have an estimation of how much data a query will require to parse, you can look at ... Not found: Dataset datastic:crypto...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found