question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot write Spark DataFrame as parquet to gcs (the DataFrame was read from a Hive Table)

See original GitHub issue

Hi all, we would like to write a dataframe to gcs with a spark job running on a yarn cluster outside GCP. As long as we read the dataframe as parquet file from hdfs path, it works. However, when we read the same data but using spark.sql and select from the respective Hive Table, we encounter the following exception:

py4j.protocol.Py4JJavaError: An error occurred while calling o97.parquet.
: java.lang.ClassCastException: org.apache.hadoop.fs.FsUrlConnection cannot be cast to java.net.HttpURLConnection
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.DefaultConnectionFactory.openConnection(DefaultConnectionFactory.java:31)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpTransport.buildRequest(NetHttpTransport.java:150)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpTransport.buildRequest(NetHttpTransport.java:55)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:871)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:322)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:346)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialFactory$GoogleCredentialWithRetry.executeRefreshToken(CredentialFactory.java:165)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:494)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:217)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:862)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:549)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:482)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:599)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1905)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1813)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfoInternal(GoogleCloudStorageFileSystem.java:1127)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1095)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1038)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:93)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

We assume that we set the hadoop configuration correctly, so that the first write method works.

    sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
    sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
    sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "")
    sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.enable", "true")
    sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", keyfile)

Could anyone shed some light for us please ?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
medbcommented, Jul 1, 2020

May you try if it’s reproducible with Apache transport? For this you need to set use fs.gs.http.transport.type=APACHE Hadoop property.

0reactions
3chamcommented, Jul 16, 2020

Hi @medb, really sorry for the late reply. It was just a normal spark.sql(“select * from schema.table”).write.parquet(“gs://path”) I don’t know if it helps.

From my side the issue could be closed. Thanks for your quick support!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Spark 2 Can't write dataframe to parquet table
I'm trying to write a dataframe to a parquet hive table and keep getting an error saying that the table is HiveFileFormat and...
Read more >
PySpark Read and Write Parquet File - Spark by {Examples}
First, create a Pyspark DataFrame from a list of data using spark. createDataFrame() method. above example, it creates a DataFrame with columns ...
Read more >
Unable to write dataframe into Hive partitioned parquet table ...
I am trying to write my data frame into Partitioned hive table .Hive table format is parquet . But i am unable to...
Read more >
How to save a dataframe as a Parquet file using PySpark
Read the CSV file into a dataframe using the function spark.read.load(). bigdata_3. Step 4: Call the method dataframe.write.parquet(), and pass ...
Read more >
Use the BigQuery connector with Spark - Google Cloud
The spark-bigquery-connector is used with Apache Spark to read and write data ... This example reads data from BigQuery into a Spark DataFrame...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found