question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Accessing GCS from Spark/Hadoop outside Google Cloud

See original GitHub issue

My issue is superficially similar to #48, but seems separate so I’m filing here.

I’m interested in reading some gs:// URLs from a local Spark/Hadoop app.

I ran gcloud auth application-default login and got a key file.

Then I run spark-shell --jars my-assembly.jar which includes this library correctly on the classpath.

Then in the spark-shell, I set the hadoop configs detailed in INSTALL.md:

val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", "<MY_PROJECT>")  // actual project filled in
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("google.cloud.auth.service.account.json.keyfile", "/path/to/keyfile")  // actual keyfile filled in

import org.apache.hadoop.fs.Path
val path = new Path("gs://BUCKET/OBJECT")  // actual path filled in
val fs = path.getFileSystem(conf)

So far so good, but then actually accessing the object fails:

scala> fs.exists(path)
…
java.io.IOException: Error accessing: bucket: BUCKET, object: OBJECT
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1706)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1732)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1617)
  at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfo(ForwardingGoogleCloudStorage.java:214)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1093)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1413)
  at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
  ... 48 elided
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
  "code" : 401,
  "errors" : [ {
    "domain" : "global",
    "location" : "Authorization",
    "locationType" : "header",
    "message" : "Anonymous users does not have storage.objects.get access to object BUCKET/OBJECT.",
    "reason" : "required"
  } ],
  "message" : "Anonymous users does not have storage.objects.get access to object BUCKET/OBJECT."
}
  at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
  at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1726)
  ... 53 more

I seemed to successfully oauth with my gcloud-linked google-account when I ran gcloud auth application-default login, per the docs; why am I acccessing GCS as an anonymous user?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:18 (5 by maintainers)

github_iconTop GitHub Comments

5reactions
ryan-williamscommented, Jul 12, 2017

Finally came back to this and got it working, thanks @dennishuo.

Steps:

  • create+download service account JSON key file here based on instructions here

  • set these Hadoop configurations:

    fs.gs.project.id: <project>
    google.cloud.auth.service.account.enable: true
    google.cloud.auth.service.account.json.keyfile: <path-to-key.json>
    

    More commonly, I set them by setting them on my SparkConf before creating a SparkContext (prepending spark.hadoop. to each key name).

I also put google-cloud-nio-0.20.1-alpha-shaded.jar on my classpath to have NIO filesystem APIs able to hit gs:// paths.

Awesome to have both avenues working locally!

2reactions
dennishuocommented, May 6, 2017

Though the docs might be worded in a confusing way referencing service accounts, gcloud auth application-default login is fundamentally a non-service-account auth flow which depends on a “refreshToken” associated with real user credentials, and is known as the “offline” installation flow. The “offline” installation flow is characterized by having a client_id and a client_secret; if you look in your JSON file you’ll see both those fields. In contrast, google.cloud.auth.service.account.json.keyfile is fundamentally as service-account auth flow.

The recommended way for setting up GCS access is indeed to use service-accounts rather than “offline installation” user credentials, because service-account management is much cleaner to edit and revoke. See https://cloud.google.com/storage/docs/authentication#service_accounts for instructions on creating a service account and downloading a JSON keyfile for that service account; that’s the one you should ideally use in your connector configuration and keep google.cloud.auth.service.account.enable=true, and it should just work.

If you absolutely must use offline installed user credentials, you’ll be using the oauth2 installed app flow and you’ll need to set:

google.cloud.auth.service.account.enable=false
google.cloud.auth.client.id=<your client id, or you can borrow the client id that came from gcloud's generated "gcloud auth application-default login" keyfile>
google.cloud.auth.client.secret=<your client secret, or you can borrow the client id that came from gcloud's generated "gcloud auth application-default login" keyfile>
google.cloud.auth.client.file=<a new file path or the same one gcloud generated>

Even if gcloud already generated it, due to differences in formatting, the GCS connector will likely still walk through the browser-redirect flow and you’ll have to authorize the app again; this means you probably want to use a different path than the location of the “gcloud auth login” generated file. In any case, this flow will require standard input to be open, since it will do a browser redirect dance and require you to input a token.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cloud Storage connector | Dataproc Documentation
The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data...
Read more >
Accessing GCS with Hadoop client from outside of Cloud
The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the...
Read more >
Configure Access to GCS from Your Cluster | CDP Private Cloud
In Cloudera Manager UI, set the following three properties under hdfs core-site.xml. · Save the configuration change and restart affected services.
Read more >
Read files from Google Cloud Storage Bucket using local ...
Apache Spark doesn't have out of the box support for Google Cloud Storage, we need to download and add the connector separately. It...
Read more >
How to Read and Write Spark Dataframe to Storage Bucket in ...
Comments · 2.1 - Create the first Dataproc Cluster | Apache Spark on Dataproc | Google Cloud Series · 2. Apache Spark -...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found