Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Accessing Google storage by Spark from outside Google cloud

See original GitHub issue

Hi guys,

We’re trying to access the Google storage from with a Spark on Yarn job (writing to gs://…) on a cluster the resides outside Google Cloud.

We have setup the correct service account and credentials but still facing some issues :

The spark.hadoop.google.cloud.auth.service.account.keyfile points to the credentials file on the Spark driver but the Spark code (workers running on different servers) still try to access the same file path (which doesn’t exist). We got to work correctly by having the credentials file on the exact same location on both the driver and the workers, but this is not practical and was a temporary workaround.

Is there any delegation token mechanism by which the driver authenticates with the Google cloud and sends the it to the workers so they don’t need to have the same credential key at the exact same path ?

We tried also to upload the credential file (p12 or json) to the workers and set : spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS or spark.executor.extraJavaOptions

to the file path (different from the driver file path) but we’re getting :

java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:87)
	at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:68)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1319)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:549)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:512)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
	at com.google.api.client.googleapis.compute.ComputeCredential.executeRefreshToken(ComputeCredential.java:87)
	at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:85)
	... 14 more

Is there any documentation for this use case that we missed ?

Thanks,

Issue Analytics

State:
Created 6 years ago
Reactions:7
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

amarounicommented, May 9, 2017

@dennishuo It was really impractical when submitting Spark jobs from a Windows client to a Linux cluster, the Spark driver was running on Windows while the Spark cluster was hosted on Linux machines. So it was impossible to use the same credentials path both on Windows and Linux.

0reactions

Sudip-Panditcommented, Apr 22, 2022

Hi Medb, could you have some ideas regarding GCP bucket to PySpark on-prem connection, so that I can get the bucket data to on-prem?

Top Results From Across the Web

Accessing Google storage by Spark from outside Google cloud

Hi guys,. We're trying to access the Google storage from with a Spark on Yarn job (writing to gs://...) on a cluster the...

Use the Cloud Storage connector with Apache Spark

In the Google Cloud console, go to the Cloud Storage Buckets page. Go to Buckets page · Click Create bucket. · On the...

Read files from Google Cloud Storage Bucket using local ...

This tutorial is a step by step guide for reading files from google cloud storage bucket in locally hosted spark instance using PySpark...

Configure Access to GCS from Your Cluster | CDP Private Cloud

After performing these steps, you should be able to start working with the Google Cloud Storage bucket(s). Parent topic: Configuring Access to Google...

Spark-submit options for gcs-connector to access google ...

Spark -submit options for gcs-connector to access google storage ;.12:3.1 ;.2,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2 ;.7/lib/ ...