Accessing Google storage by Spark from outside Google cloud
See original GitHub issueHi guys,
We’re trying to access the Google storage from with a Spark on Yarn job (writing to gs://…) on a cluster the resides outside Google Cloud.
We have setup the correct service account and credentials but still facing some issues :
The spark.hadoop.google.cloud.auth.service.account.keyfile
points to the credentials file on the Spark driver but the Spark code (workers running on different servers) still try to access the same file path (which doesn’t exist). We got to work correctly by having the credentials file on the exact same location on both the driver and the workers, but this is not practical and was a temporary workaround.
Is there any delegation token mechanism by which the driver authenticates with the Google cloud and sends the it to the workers so they don’t need to have the same credential key at the exact same path ?
We tried also to upload the credential file (p12 or json) to the workers and set :
spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS
or
spark.executor.extraJavaOptions
to the file path (different from the driver file path) but we’re getting :
java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:87)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:68)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1319)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:549)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:512)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
Caused by: java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.compute.ComputeCredential.executeRefreshToken(ComputeCredential.java:87)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:85)
... 14 more
Is there any documentation for this use case that we missed ?
Thanks,
Issue Analytics
- State:
- Created 6 years ago
- Reactions:7
- Comments:10 (4 by maintainers)
@dennishuo It was really impractical when submitting Spark jobs from a Windows client to a Linux cluster, the Spark driver was running on Windows while the Spark cluster was hosted on Linux machines. So it was impossible to use the same credentials path both on Windows and Linux.
Hi Medb, could you have some ideas regarding GCP bucket to PySpark on-prem connection, so that I can get the bucket data to on-prem?