question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Accessing Google storage by Spark from outside Google cloud

See original GitHub issue

Hi guys,

We’re trying to access the Google storage from with a Spark on Yarn job (writing to gs://…) on a cluster the resides outside Google Cloud.

We have setup the correct service account and credentials but still facing some issues :

The spark.hadoop.google.cloud.auth.service.account.keyfile points to the credentials file on the Spark driver but the Spark code (workers running on different servers) still try to access the same file path (which doesn’t exist). We got to work correctly by having the credentials file on the exact same location on both the driver and the workers, but this is not practical and was a temporary workaround.

Is there any delegation token mechanism by which the driver authenticates with the Google cloud and sends the it to the workers so they don’t need to have the same credential key at the exact same path ?

We tried also to upload the credential file (p12 or json) to the workers and set : spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS or spark.executor.extraJavaOptions

to the file path (different from the driver file path) but we’re getting :

java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:87)
	at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:68)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1319)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:549)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:512)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
	at com.google.api.client.googleapis.compute.ComputeCredential.executeRefreshToken(ComputeCredential.java:87)
	at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:85)
	... 14 more

Is there any documentation for this use case that we missed ?

Thanks,

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:7
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
amarounicommented, May 9, 2017

@dennishuo It was really impractical when submitting Spark jobs from a Windows client to a Linux cluster, the Spark driver was running on Windows while the Spark cluster was hosted on Linux machines. So it was impossible to use the same credentials path both on Windows and Linux.

0reactions
Sudip-Panditcommented, Apr 22, 2022

Hi Medb, could you have some ideas regarding GCP bucket to PySpark on-prem connection, so that I can get the bucket data to on-prem?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Accessing Google storage by Spark from outside Google cloud
Hi guys,. We're trying to access the Google storage from with a Spark on Yarn job (writing to gs://...) on a cluster the...
Read more >
Use the Cloud Storage connector with Apache Spark
In the Google Cloud console, go to the Cloud Storage Buckets page. Go to Buckets page · Click Create bucket. · On the...
Read more >
Read files from Google Cloud Storage Bucket using local ...
This tutorial is a step by step guide for reading files from google cloud storage bucket in locally hosted spark instance using PySpark...
Read more >
Configure Access to GCS from Your Cluster | CDP Private Cloud
After performing these steps, you should be able to start working with the Google Cloud Storage bucket(s). Parent topic: Configuring Access to Google...
Read more >
Spark-submit options for gcs-connector to access google ...
Spark -submit options for gcs-connector to access google storage ;.12:3.1 ;.2,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2 ;.7/lib/ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found