Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

java.io.IOException: Error getting 'BUCKET_GLOBAL_IDENTIFIER' bucket

See original GitHub issue

I want to enable Spark to export data to Google Cloud Storage, instead of saving it on HDFS. To achieve this, I have installed Google Cloud Storage Connector for Spark. Here’s a sample code inside a Spark context, which I use to save a dataframe to a bucket:

val someDF = Seq(
    (8, "bat"),
    (64, "mouse"),
    (-27, null)
).toDF("number", "word")

val conf = sc.hadoopConfiguration
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", PROJECT_ID)
conf.set("fs.gs.auth.service.account.enable", "true")
conf.set("fs.gs.auth.service.account.json.keyfile", LOCATION_TO_KEY.json)

someDF
    .write
    .format("parquet")
    .mode("overwrite")
    .save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/")

I receive a rather cryptic exception after the code is executed:

java.io.IOException: Error getting 'BUCKET_GLOBAL_IDENTIFIER' bucket
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1633)
  at com.google.cloud.hadoop.gcsio.BatchHelper.execute(BatchHelper.java:183)
  at com.google.cloud.hadoop.gcsio.BatchHelper.lambda$queue$0(BatchHelper.java:163)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createJsonResponseException(GoogleCloudStorageExceptions.java:82)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1624)
  ... 6 more

Could anyone give me a clue on how to tackle this? Here’s a list of issues I’ve already solved, to get to this point:

The key could not be accessed by Spark. The issue was that it was not available on physical nodes, which Spark was run on.
GCS service account, used for the Spark connector, did not have a permission to create a bucket. The issue was solved by saving the data to an already existing bucket.

Issue Analytics

State:
Created 4 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

medbcommented, Jan 13, 2020

This error means that configured service account doesn’t have access to the <BUCKET_GLOBAL_IDENTIFIER> bucket or doesn’t have permissions to perform bucket get requests.

May you test your configuration by specifying non-existent bucket? (GCS connector should create this bucket by itself in this case)

0reactions

medbcommented, Mar 18, 2021

@ashishdeok15 please open a new issue and provide tailed description including code snippets of what are you doing and exception w/ stack trace that you are facing.

Top Results From Across the Web

java.io.IOException: Error getting ...

This error means that configured service account doesn't have access to the <BUCKET_GLOBAL_IDENTIFIER> bucket or doesn't have permissions to ...

Amazon EMR and Hive: Getting a "java.io.IOException

The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories. According to this...

Error in accessing google cloud storage bucket via hadoop fs

Hi,. I am getting the below error while accessing a Google Cloud Storage bucket for the first time via Cloudera CDH 6.3.3 Hadoop...

Error: "Bucket is a requester pays bucket but no user project ...

Hi! I am trying to annotate a matrix with CADD scores. db = hl.experimental.DB(region='us', cloud='gcp') mt = db.annotate_rows_db(mt, 'CADD') Tried to ...

Troubleshooting | VPC Service Controls - Google Cloud

Using the error's unique ID; Filter logs using metadata ... java.io.IOException: Error accessing: bucket: corp-resources-public-1, object: out.txt