question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

java.io.IOException: Error getting 'BUCKET_GLOBAL_IDENTIFIER' bucket

See original GitHub issue

I want to enable Spark to export data to Google Cloud Storage, instead of saving it on HDFS. To achieve this, I have installed Google Cloud Storage Connector for Spark. Here’s a sample code inside a Spark context, which I use to save a dataframe to a bucket:

val someDF = Seq(
    (8, "bat"),
    (64, "mouse"),
    (-27, null)
).toDF("number", "word")

val conf = sc.hadoopConfiguration
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", PROJECT_ID)
conf.set("fs.gs.auth.service.account.enable", "true")
conf.set("fs.gs.auth.service.account.json.keyfile", LOCATION_TO_KEY.json)

someDF
    .write
    .format("parquet")
    .mode("overwrite")
    .save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/")

I receive a rather cryptic exception after the code is executed:

java.io.IOException: Error getting 'BUCKET_GLOBAL_IDENTIFIER' bucket
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1633)
  at com.google.cloud.hadoop.gcsio.BatchHelper.execute(BatchHelper.java:183)
  at com.google.cloud.hadoop.gcsio.BatchHelper.lambda$queue$0(BatchHelper.java:163)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createJsonResponseException(GoogleCloudStorageExceptions.java:82)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1624)
  ... 6 more

Could anyone give me a clue on how to tackle this? Here’s a list of issues I’ve already solved, to get to this point:

  • The key could not be accessed by Spark. The issue was that it was not available on physical nodes, which Spark was run on.
  • GCS service account, used for the Spark connector, did not have a permission to create a bucket. The issue was solved by saving the data to an already existing bucket.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
medbcommented, Jan 13, 2020

This error means that configured service account doesn’t have access to the <BUCKET_GLOBAL_IDENTIFIER> bucket or doesn’t have permissions to perform bucket get requests.

May you test your configuration by specifying non-existent bucket? (GCS connector should create this bucket by itself in this case)

0reactions
medbcommented, Mar 18, 2021

@ashishdeok15 please open a new issue and provide tailed description including code snippets of what are you doing and exception w/ stack trace that you are facing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

java.io.IOException: Error getting ...
This error means that configured service account doesn't have access to the <BUCKET_GLOBAL_IDENTIFIER> bucket or doesn't have permissions to ...
Read more >
Amazon EMR and Hive: Getting a "java.io.IOException
The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories. According to this...
Read more >
Error in accessing google cloud storage bucket via hadoop fs
Hi,. I am getting the below error while accessing a Google Cloud Storage bucket for the first time via Cloudera CDH 6.3.3 Hadoop...
Read more >
Error: "Bucket is a requester pays bucket but no user project ...
Hi! I am trying to annotate a matrix with CADD scores. db = hl.experimental.DB(region='us', cloud='gcp') mt = db.annotate_rows_db(mt, 'CADD') Tried to ...
Read more >
Troubleshooting | VPC Service Controls - Google Cloud
Using the error's unique ID; Filter logs using metadata ... java.io.IOException: Error accessing: bucket: corp-resources-public-1, object: out.txt
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found