"Reading data from a BigQuery query" not working: dataset not parsed or provided
See original GitHub issueI’m using
Java 8
scala 2.12
dataproc image: 2.0-debian10
In my project, i have to send query to bigquery not using spark sql. And i tried three ways but doesn’t work.
first try
var bigQueryOption = new HashMap<String, String>() {{
put("credentials", credentials);
put("viewsEnabled", "true");
}}
var dfReader = spark.read
.options(bigQueryOption)
.format('bigquery')
val df = dfReader
.load("SELECT * FROM `project.dataset.table` LIMIT 5")
.cache()
When i run this code, i got message like below.
Exception in thread "main" com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) Error in custom provider, java.lang.IllegalArgumentException: 'dataset' not parsed or provided.
at com.google.cloud.spark.bigquery.v2.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:68)
while locating com.google.cloud.spark.bigquery.SparkBigQueryConfig
1 error
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:76)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
second try
var bigQueryOption = new HashMap<String, String>() {{
put("credentials", credentials);
put("viewsEnabled", "true");
put("query", "SELECT * FROM `project.dataset.table` LIMIT 5");
}}
var dfReader = spark.read
.options(bigQueryOption)
.format('bigquery')
val df = dfReader
.load()
.cache()
but doesn’t work with same error.
third try
var bigQueryOption = new HashMap<String, String>() {{
put("credentials", credentials);
put("viewsEnabled", "true");
put("query", "SELECT * FROM `project.dataset.table` LIMIT 5");
}}
var dfReader = spark.read
.options(bigQueryOption)
.format('bigquery')
val df = dfReader
.load("project.dataset.table")
.cache()
This code works same as
spark.read.format('bigquery').load('project.dataset.table').cache()
which is no query option.
Is it related on java or scala version?
I think exception is from parseTableId
function (https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryUtil.java#L116)
I checked that “project.dataset.table” matched to QUALIFIED_TABLE_REGEX
(https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryUtil.java#L56)
but there’s no problem.
i’ve seen the issue(https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/330) that solved with using dataproc image 1.4 But i don’t want to downgrade image…
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Please add the
materializationDataset
option (see here), make sure that the use has write permission to this datasetWe will fix the error message
Thanks for the explanation @davidrabinowitz This was really helpful. I have some queries, believe someone can clarify this
Upon using
materializationDataset
. I can see the spark job produces some table like_bqc_*
this table created with expiration which is assuring.However, my question / concern is about the querying cost in BigQuery.
This statement was little concerning,
Would I be charged in BigQuery side upon using df (
_bqc_
) for my multiple aggregations or once ?Thanks in advance 🙏