Failure to write to bigquery - org.apache.spark.SparkException: Task failed while writing rows
See original GitHub issueHi Everyone,
I’m currently trying to upload a Spark DataFrame as a table in BigQuery. I’ve tried to follow the installation instructions. I can load a table into a DataFrame without error, I’m getting an org.apache.spark.SparkException: Task failed while writing rows
error when I try to upload the DataFrame to BigQuery as a table.
Packages and versions:
javacOptions ++= Seq("-source", "11", "-target", "11")
ThisBuild / scalaVersion := "2.11.10"
val sparkBigqueryVersion = "0.27.1"
val sparkVersion = "2.4.8"
...
libraryDependencies += "org.scala-lang" % "scala-library" % scalaVersion.value % "provided",
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion,
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion,
libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % sparkBigqueryVersion,
libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.8",
dependencyOverrides += "com.google.guava" % "guava" % "30.1-jre",
libraryDependencies += "org.rogach" %% "scallop" % "4.1.0"`
Code to upload:
val spark = SparkSession
.builder()
.appName("AppName")
.config("spark.master", "local")
.config("spark.sql.broadcastTimeout", "36000")
.config("temporaryGcsBucket", tempGsBucket)
.config("credentialsFile", gsKeyFilePath)
.config("parentProject", gsProject)
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.sparkContext.hadoopConfiguration.setBoolean("google.cloud.auth.service.account.enable", true)
spark.sparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile", gsKeyFilePath)
spark.sparkContext.hadoopConfiguration.set("fs.gs.project.id", gsProject)
// Load table to Spark DataFrame
var dataFrame =
(spark.read.format("bigquery")
.option("table",s"$bqDataset.$bqTable")
.load()
.cache())
// Save Spark DataFrame to BigQuery as table
dataFrame.write.format("bigquery")
.option("table",s"$bqDataset.$bqTable")
.mode(SaveMode.Overwrite)
.save()
Logs:
[error] com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Failed to write to BigQuery
[error] at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:110)
[error] at com.google.cloud.spark.bigquery.write.BigQueryDeprecatedIndirectInsertableRelation.insert(BigQueryDeprecatedIndirectInsertableRelation.java:43)
[error] at com.google.cloud.spark.bigquery.write.CreatableRelationProviderHelper.createRelation(CreatableRelationProviderHelper.java:54)
[error] at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:106)
[error] at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[error] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[error] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[error] at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error] at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
[error] at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
[error] at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[error] at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[error] at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[error] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[error] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[error] at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
[error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
[error] at helpers.Helpers$class.saveDataFrameToTable(helpers.scala:65)
[error] at TransformColumn$.saveDataFrameToTable(transformColumn.scala:5)
[error] at TransformColumn$.combined(transformColumn.scala:117)
[error] at TransformColumn$.main(transformColumn.scala:69)
[error] at TransformColumn.main(transformColumn.scala)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: org.apache.spark.SparkException: Job aborted.
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:202)
[error] at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
[error] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
[error] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
[error] at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error] at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
[error] at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
[error] at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[error] at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[error] at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[error] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[error] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[error] at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
[error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
[error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
[error] at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:105)
[error] at com.google.cloud.spark.bigquery.write.BigQueryDeprecatedIndirectInsertableRelation.insert(BigQueryDeprecatedIndirectInsertableRelation.java:43)
[error] at com.google.cloud.spark.bigquery.write.CreatableRelationProviderHelper.createRelation(CreatableRelationProviderHelper.java:54)
[error] at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:106)
[error] at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[error] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[error] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[error] at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
[error] at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error] at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
[error] at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
[error] at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[error] at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[error] at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[error] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[error] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[error] at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
[error] at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
[error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
[error] at helpers.Helpers$class.saveDataFrameToTable(helpers.scala:65)
[error] at TransformColumn$.saveDataFrameToTable(transformColumn.scala:5)
[error] at TransformColumn$.combined(transformColumn.scala:117)
[error] at TransformColumn$.main(transformColumn.scala:69)
[error] at TransformColumn.main(transformColumn.scala)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:174)
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
[error] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:123)
[error] at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
[error] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
[error] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[error] at java.base/java.lang.Thread.run(Thread.java:829)
[error] Caused by: java.io.IOException: Failed to write 2707910 bytes in 'gs://gs-bucket/.spark-bigquery-local-1667736484732-4ffcecbc-47e4-4fce-b585-3ba03489deac/_temporary/0/_temporary/attempt_20221106120815_0000_m_000003_3/part-00003-015be3ad-69fa-4a5e-9ce8-070043126923-c000.snappy.parquet'
[error] at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:136)
[error] at java.base/java.nio.channels.Channels.writeFullyImpl(Channels.java:74)
[error] at java.base/java.nio.channels.Channels.writeFully(Channels.java:97)
[error] at java.base/java.nio.channels.Channels$1.write(Channels.java:172)
[error] at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
[error] at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
[error] at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.write(GoogleHadoopOutputStream.java:108)
[error] at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
[error] at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
[error] at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
[error] at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:45)
[error] at org.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)
[error] at org.apache.parquet.hadoop.ParquetFileWriter.writeDataPages(ParquetFileWriter.java:460)
[error] at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:201)
[error] at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:261)
[error] at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:173)
[error] at org.apache.parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:148)
[error] at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:130)
[error] at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
[error] at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
[error] at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:249)
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:246)
[error] at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:252)
[error] ... 10 more
[error] Suppressed: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
[error] at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:168)
[error] at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:160)
[error] at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:291)
[error] at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:171)
[error] at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
[error] at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
[error] at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
[error] at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:83)
[error] at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:254)
[error] at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1403)
[error] ... 11 more
[error] Caused by: java.io.IOException: Pipe closed
[error] at java.base/java.io.PipedInputStream.checkStateForReceive(PipedInputStream.java:260)
[error] at java.base/java.io.PipedInputStream.receive(PipedInputStream.java:226)
[error] at java.base/java.io.PipedOutputStream.write(PipedOutputStream.java:149)
[error] at java.base/java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:464)
[error] at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:133)
[error] ... 35 more
thScope(RDDOperationScope.scala:151)
Trying to understand the following error and what is causing it:
[error] Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
Supporting information:
- Testing on local
- DataFrame in question has 1000 partitions
- Checked Service Account has the correct permission for to upload to BigQuery and Cloud Storage
I’m trying to understand if this is to do with an error in my installation and/or implementation.
Would appreciate any help!
Issue Analytics
- State:
- Created 10 months ago
- Comments:7 (1 by maintainers)
Fixed by doing the following:
.option("writeMethod", "direct")
when writing to BQUnsupported class file major version 55
)@davidrabinowitz - I’m assuming that to force the direct method I can add the following code snippet
.option("writeMethod", "direct")
when writing to BQ.