question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failure to write to bigquery - org.apache.spark.SparkException: Task failed while writing rows

See original GitHub issue

Hi Everyone,

I’m currently trying to upload a Spark DataFrame as a table in BigQuery. I’ve tried to follow the installation instructions. I can load a table into a DataFrame without error, I’m getting an org.apache.spark.SparkException: Task failed while writing rows error when I try to upload the DataFrame to BigQuery as a table.

Packages and versions:

    javacOptions ++= Seq("-source", "11", "-target", "11")
    ThisBuild / scalaVersion := "2.11.10"
    val sparkBigqueryVersion = "0.27.1"
    val sparkVersion = "2.4.8"
    ...
    libraryDependencies += "org.scala-lang" % "scala-library" % scalaVersion.value % "provided",
    libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion,
    libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion,
    libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % sparkBigqueryVersion,
    libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.8",
    dependencyOverrides += "com.google.guava" % "guava" % "30.1-jre",
    libraryDependencies += "org.rogach" %% "scallop" % "4.1.0"`

Code to upload:

    val spark = SparkSession
        .builder()
        .appName("AppName")
        .config("spark.master", "local")
        .config("spark.sql.broadcastTimeout", "36000")
        .config("temporaryGcsBucket", tempGsBucket)
        .config("credentialsFile", gsKeyFilePath)
        .config("parentProject", gsProject)
        .getOrCreate()

    spark.sparkContext.hadoopConfiguration.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
    spark.sparkContext.hadoopConfiguration.setBoolean("google.cloud.auth.service.account.enable", true)
    spark.sparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile", gsKeyFilePath)
    spark.sparkContext.hadoopConfiguration.set("fs.gs.project.id", gsProject)

    // Load table to Spark DataFrame
    var dataFrame =
      (spark.read.format("bigquery")
      .option("table",s"$bqDataset.$bqTable")
      .load()
      .cache())

    // Save Spark DataFrame to BigQuery as table
    dataFrame.write.format("bigquery")
      .option("table",s"$bqDataset.$bqTable")
      .mode(SaveMode.Overwrite)
      .save()

Logs:

[error] com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Failed to write to BigQuery
[error]  at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:110)
[error]  at com.google.cloud.spark.bigquery.write.BigQueryDeprecatedIndirectInsertableRelation.insert(BigQueryDeprecatedIndirectInsertableRelation.java:43)
[error]  at com.google.cloud.spark.bigquery.write.CreatableRelationProviderHelper.createRelation(CreatableRelationProviderHelper.java:54)
[error]  at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:106)
[error]  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[error]  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[error]  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[error]  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
[error]  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error]  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
[error]  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
[error]  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[error]  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[error]  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[error]  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[error]  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[error]  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
[error]  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
[error]  at helpers.Helpers$class.saveDataFrameToTable(helpers.scala:65)
[error]  at TransformColumn$.saveDataFrameToTable(transformColumn.scala:5)
[error]  at TransformColumn$.combined(transformColumn.scala:117)
[error]  at TransformColumn$.main(transformColumn.scala:69)
[error]  at TransformColumn.main(transformColumn.scala)
[error]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: org.apache.spark.SparkException: Job aborted.
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:202)
[error]  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
[error]  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
[error]  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
[error]  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
[error]  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error]  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
[error]  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
[error]  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[error]  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[error]  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[error]  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[error]  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[error]  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
[error]  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
[error]  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
[error]  at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:105)
[error]  at com.google.cloud.spark.bigquery.write.BigQueryDeprecatedIndirectInsertableRelation.insert(BigQueryDeprecatedIndirectInsertableRelation.java:43)
[error]  at com.google.cloud.spark.bigquery.write.CreatableRelationProviderHelper.createRelation(CreatableRelationProviderHelper.java:54)
[error]  at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:106)
[error]  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[error]  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[error]  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[error]  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
[error]  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
[error]  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error]  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
[error]  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
[error]  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[error]  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[error]  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[error]  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[error]  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[error]  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
[error]  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
[error]  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
[error]  at helpers.Helpers$class.saveDataFrameToTable(helpers.scala:65)
[error]  at TransformColumn$.saveDataFrameToTable(transformColumn.scala:5)
[error]  at TransformColumn$.combined(transformColumn.scala:117)
[error]  at TransformColumn$.main(transformColumn.scala:69)
[error]  at TransformColumn.main(transformColumn.scala)
[error]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:174)
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
[error]  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[error]  at org.apache.spark.scheduler.Task.run(Task.scala:123)
[error]  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
[error]  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
[error]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
[error]  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[error]  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[error]  at java.base/java.lang.Thread.run(Thread.java:829)
[error] Caused by: java.io.IOException: Failed to write 2707910 bytes in 'gs://gs-bucket/.spark-bigquery-local-1667736484732-4ffcecbc-47e4-4fce-b585-3ba03489deac/_temporary/0/_temporary/attempt_20221106120815_0000_m_000003_3/part-00003-015be3ad-69fa-4a5e-9ce8-070043126923-c000.snappy.parquet'
[error]  at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:136)
[error]  at java.base/java.nio.channels.Channels.writeFullyImpl(Channels.java:74)
[error]  at java.base/java.nio.channels.Channels.writeFully(Channels.java:97)
[error]  at java.base/java.nio.channels.Channels$1.write(Channels.java:172)
[error]  at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
[error]  at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
[error]  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.write(GoogleHadoopOutputStream.java:108)
[error]  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
[error]  at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
[error]  at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
[error]  at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:45)
[error]  at org.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)
[error]  at org.apache.parquet.hadoop.ParquetFileWriter.writeDataPages(ParquetFileWriter.java:460)
[error]  at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:201)
[error]  at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:261)
[error]  at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:173)
[error]  at org.apache.parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:148)
[error]  at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:130)
[error]  at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
[error]  at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
[error]  at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
[error]  at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:249)
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:246)
[error]  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
[error]  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:252)
[error]  ... 10 more
[error]  Suppressed: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
[error]          at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:168)
[error]          at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:160)
[error]          at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:291)
[error]          at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:171)
[error]          at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
[error]          at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
[error]          at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
[error]          at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
[error]          at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:83)
[error]          at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:254)
[error]          at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1403)
[error]          ... 11 more
[error] Caused by: java.io.IOException: Pipe closed
[error]  at java.base/java.io.PipedInputStream.checkStateForReceive(PipedInputStream.java:260)
[error]  at java.base/java.io.PipedInputStream.receive(PipedInputStream.java:226)
[error]  at java.base/java.io.PipedOutputStream.write(PipedOutputStream.java:149)
[error]  at java.base/java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:464)
[error]  at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:133)
[error]  ... 35 more
thScope(RDDOperationScope.scala:151)

Trying to understand the following error and what is causing it: [error] Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.

Supporting information:

  • Testing on local
  • DataFrame in question has 1000 partitions
  • Checked Service Account has the correct permission for to upload to BigQuery and Cloud Storage

I’m trying to understand if this is to do with an error in my installation and/or implementation.

Would appreciate any help!

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
calum-mcgcommented, Nov 15, 2022

Fixed by doing the following:

  • Adding .option("writeMethod", "direct") when writing to BQ
  • Downgrading Java version (was getting Unsupported class file major version 55)
0reactions
calum-mcgcommented, Nov 15, 2022

@davidrabinowitz - I’m assuming that to force the direct method I can add the following code snippet .option("writeMethod", "direct") when writing to BQ.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Exception : Task failed while writing rows - Stack Overflow
It seems that speculative and non-speculative tasks are conflicting when writing parquet rows. sparkConf.set("spark.speculation","false").
Read more >
Task failed while writing rows. #600 - databricks/spark-xml
org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.
Read more >
Solved: org.apache.spark.SparkException: Task failed while...
Hi,. I am using HDP2.3.2 with Spark 1.4.1 and trying to insert data in hive table using hive context. Below is the sample...
Read more >
Error messages | BigQuery - Google Cloud
Error message HTTP code Description stopped 200 This status code returns when a job is canceled. timeout 400 The job timed out.
Read more >
INTERNAL: Received unexpected EOS on DATA frame from ...
c.infaperf-141908.internal, executor 12): org.apache.spark.SparkException: Task failed while writing rows. ... Caused by: com.google.cloud.spark.bigquery.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found