Cannot write Spark DataFrame as parquet to gcs (the DataFrame was read from a Hive Table)
See original GitHub issueHi all, we would like to write a dataframe to gcs with a spark job running on a yarn cluster outside GCP. As long as we read the dataframe as parquet file from hdfs path, it works. However, when we read the same data but using spark.sql and select from the respective Hive Table, we encounter the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o97.parquet.
: java.lang.ClassCastException: org.apache.hadoop.fs.FsUrlConnection cannot be cast to java.net.HttpURLConnection
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.DefaultConnectionFactory.openConnection(DefaultConnectionFactory.java:31)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpTransport.buildRequest(NetHttpTransport.java:150)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpTransport.buildRequest(NetHttpTransport.java:55)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:871)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:322)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:346)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialFactory$GoogleCredentialWithRetry.executeRefreshToken(CredentialFactory.java:165)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:494)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:217)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:862)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:549)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:482)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:599)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1905)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1813)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfoInternal(GoogleCloudStorageFileSystem.java:1127)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1095)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1038)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:93)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
We assume that we set the hadoop configuration correctly, so that the first write method works.
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "")
sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.enable", "true")
sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", keyfile)
Could anyone shed some light for us please ?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Solved: Spark 2 Can't write dataframe to parquet table
I'm trying to write a dataframe to a parquet hive table and keep getting an error saying that the table is HiveFileFormat and...
Read more >PySpark Read and Write Parquet File - Spark by {Examples}
First, create a Pyspark DataFrame from a list of data using spark. createDataFrame() method. above example, it creates a DataFrame with columns ...
Read more >Unable to write dataframe into Hive partitioned parquet table ...
I am trying to write my data frame into Partitioned hive table .Hive table format is parquet . But i am unable to...
Read more >How to save a dataframe as a Parquet file using PySpark
Read the CSV file into a dataframe using the function spark.read.load(). bigdata_3. Step 4: Call the method dataframe.write.parquet(), and pass ...
Read more >Use the BigQuery connector with Spark - Google Cloud
The spark-bigquery-connector is used with Apache Spark to read and write data ... This example reads data from BigQuery into a Spark DataFrame...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
May you try if it’s reproducible with Apache transport? For this you need to set use
fs.gs.http.transport.type=APACHE
Hadoop property.Hi @medb, really sorry for the late reply. It was just a normal spark.sql(“select * from schema.table”).write.parquet(“gs://path”) I don’t know if it helps.
From my side the issue could be closed. Thanks for your quick support!