Accessing GCS from Spark/Hadoop outside Google Cloud
See original GitHub issueMy issue is superficially similar to #48, but seems separate so I’m filing here.
I’m interested in reading some gs://
URLs from a local Spark/Hadoop app.
I ran gcloud auth application-default login
and got a key file.
Then I run spark-shell --jars my-assembly.jar
which includes this library correctly on the classpath.
Then in the spark-shell, I set the hadoop configs detailed in INSTALL.md:
val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", "<MY_PROJECT>") // actual project filled in
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("google.cloud.auth.service.account.json.keyfile", "/path/to/keyfile") // actual keyfile filled in
import org.apache.hadoop.fs.Path
val path = new Path("gs://BUCKET/OBJECT") // actual path filled in
val fs = path.getFileSystem(conf)
So far so good, but then actually accessing the object fails:
scala> fs.exists(path)
…
java.io.IOException: Error accessing: bucket: BUCKET, object: OBJECT
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1706)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1732)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1617)
at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfo(ForwardingGoogleCloudStorage.java:214)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1093)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1413)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
... 48 elided
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
"code" : 401,
"errors" : [ {
"domain" : "global",
"location" : "Authorization",
"locationType" : "header",
"message" : "Anonymous users does not have storage.objects.get access to object BUCKET/OBJECT.",
"reason" : "required"
} ],
"message" : "Anonymous users does not have storage.objects.get access to object BUCKET/OBJECT."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1726)
... 53 more
I seemed to successfully oauth with my gcloud-linked google-account when I ran gcloud auth application-default login
, per the docs; why am I acccessing GCS as an anonymous user?
Issue Analytics
- State:
- Created 6 years ago
- Comments:18 (5 by maintainers)
Top Results From Across the Web
Cloud Storage connector | Dataproc Documentation
The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data...
Read more >Accessing GCS with Hadoop client from outside of Cloud
The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the...
Read more >Configure Access to GCS from Your Cluster | CDP Private Cloud
In Cloudera Manager UI, set the following three properties under hdfs core-site.xml. · Save the configuration change and restart affected services.
Read more >Read files from Google Cloud Storage Bucket using local ...
Apache Spark doesn't have out of the box support for Google Cloud Storage, we need to download and add the connector separately. It...
Read more >How to Read and Write Spark Dataframe to Storage Bucket in ...
Comments · 2.1 - Create the first Dataproc Cluster | Apache Spark on Dataproc | Google Cloud Series · 2. Apache Spark -...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Finally came back to this and got it working, thanks @dennishuo.
Steps:
create+download service account JSON key file here based on instructions here
set these Hadoop configurations:
More commonly, I set them by setting them on my
SparkConf
before creating aSparkContext
(prependingspark.hadoop.
to each key name).I also put google-cloud-nio-0.20.1-alpha-shaded.jar on my classpath to have NIO filesystem APIs able to hit
gs://
paths.Awesome to have both avenues working locally!
Though the docs might be worded in a confusing way referencing service accounts,
gcloud auth application-default login
is fundamentally a non-service-account auth flow which depends on a “refreshToken” associated with real user credentials, and is known as the “offline” installation flow. The “offline” installation flow is characterized by having aclient_id
and aclient_secret
; if you look in your JSON file you’ll see both those fields. In contrast,google.cloud.auth.service.account.json.keyfile
is fundamentally as service-account auth flow.The recommended way for setting up GCS access is indeed to use service-accounts rather than “offline installation” user credentials, because service-account management is much cleaner to edit and revoke. See https://cloud.google.com/storage/docs/authentication#service_accounts for instructions on creating a service account and downloading a JSON keyfile for that service account; that’s the one you should ideally use in your connector configuration and keep
google.cloud.auth.service.account.enable=true
, and it should just work.If you absolutely must use offline installed user credentials, you’ll be using the oauth2 installed app flow and you’ll need to set:
Even if gcloud already generated it, due to differences in formatting, the GCS connector will likely still walk through the browser-redirect flow and you’ll have to authorize the app again; this means you probably want to use a different path than the location of the “gcloud auth login” generated file. In any case, this flow will require standard input to be open, since it will do a browser redirect dance and require you to input a token.