Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when writing to Redshift

See original GitHub issue

Hi, I am getting the following error when trying to write to Redshift from EMR/Spark. I am able to read from Redshift successfully. I am using Spark 2.2.0 on EMR and the databricks-redshift driver.

Appreciate any help to get this resolved quickly.

Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String

See details below.

Thanks!

**** DETAILS **** 17/12/01 01:25:43 WARN Utils$: The S3 bucket XXXXX does not have an object lifecycle configuration to ensure cleanup of temporary files. Consider configuring tempdir to point to a bucket with an object lifecycle policy that automatically deletes files after an expiration period. For more information, see https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html

[Stage 22:> (0 + 2) / 2]17/12/01 01:27:28 WARN TaskSetManager: Lost task 1.0 in stage 22.0 (TID 1234, ip-nnn-nnn-nnn.us-east-2.compute.internal, executor 11): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String; at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) … 8 more

17/12/01 01:59:12 ERROR TaskSetManager: Task 1 in stage 3.0 failed 4 times; aborting job 17/12/01 01:59:12 ERROR FileFormatWriter: Aborting job null.

This is while trying to do the demo at AWS Big Data blog: Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning (https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/#more-1340). My EMR cluster has Spark 2.2.0, and I invoked Spark as below:

======== spark-shell --jars spark-redshift_2.10-2.0.0.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar,minimal-json-0.9.4.jar,spark-avro_2.11-3.0.0.jar

Here is the code snippet used for writing…

val s3TempDir2 = “s3://<mybucket>/predict_flight_delay_with_Spark/output2/”

flightsDF.write .format(“com.databricks.spark.redshift”) .option(“temporary_aws_access_key_id”, awsAccessKey) .option(“temporary_aws_secret_access_key”, awsSecretKey) .option(“temporary_aws_session_token”, token) .option(“url”, jdbcURL) .option(“dbtable”, “ord_flights_new”) .option(“aws_iam_role”, “arn:aws:iam::xxxxxxxxx:role/RedShiftFullAccess”) .option(“tempdir”, s3TempDir2) .mode(SaveMode.Overwrite) .save()

============ print (firstFlightsDF) [2702961,10,2,2017-01-10,6,AA,1700,13,0,89,1,599]

flightsDF: org.apache.spark.sql.DataFrame = [id: bigint, day_of_month: int … 10 more fields] firstFlightsDF: org.apache.spark.sql.Row = [2702961,10,2,2017-01-10,6,AA,1700,13,0,89,1,599]

=============

Issue Analytics

State:
Created 6 years ago
Comments:14

Top GitHub Comments

1reaction

GabeChurchcommented, Jun 13, 2018

I’ll go ahead and post the solution that I came up with. Basically what I’ve discovered is that there are some compatibility issues using certain hadoop-aws jar to write to s3 with the s3a format. By switching to the s3n format you can combat this, the hadoop-aws 2.7.4 jar seems to be the consensus safest choice.

The aformentioned error also seems to be directly associated with an avro formatting problem.

I was able to write to redshift with all of the aforementioned combinations when using the following code to write.

df.write
    .format("com.databricks.spark.redshift")
    .option("url", "jdbc:redshift://yoururl$user=yourUser&password=your_password")
    .option("dbtable", "optionaldbname.tablename")
    .option("forward_spark_s3_credentials",true)
    .option("tempFormat", "CSV GZIP")  //default is avro, can also use CSV and CSV GZIP formats
    .option("tempdir", "s3n://myS3/path/to/bucket")
    .mode(SaveMode.Overwrite)
    .save()

Still interested in getting avro to work

0reactions

zbinkleytestcommented, Jun 18, 2019

Has anyone been able to solve this in Spark 2.4.3? I am using the following jars and packages when running a local instance, and have no trouble reading from Redshift. I am getting the same error as above when writing however (even when switching to csv format).

spark = SparkSession.builder.master(“local”).appName(“Test”)
.config(“spark.jars”, ‘RedshiftJDBC4-1.2.1.1001.jar,jets3t-0.9.0.jar,spark-avro_2.11-4.0.0.jar,hadoop-aws-2.7.4.jar’)
.config(“spark.jars.packages”, ‘com.databricks:spark-redshift_2.10:0.5.0,com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.7.4’)
.getOrCreate()

Top Results From Across the Web

How to fix errors when writing to a Redshift DB? - Openprise

To fix the problem, have your Redshift administrator apply the commands below: · GRANT ALL PRIVILEGES ON TABLE <table_name> TO <user>; · or...

Troubleshoot Amazon Redshift connection errors

Invalid operation connection error · 1. Open the Amazon Redshift console. · 2. Choose the Config tab. · 3. Modify the parameter group...

Spark - Read only error while trying to write to Redshift

sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only; at com.amazon.redshift.client.

Error while initializing the writer : [Unsupported COPY ...

Starting the August 2016 (August 13th, 2016) release of the Amazon Redshift connector, validation check for the supported format has been ...

Error writing Redshift-[Amazon](500310) Invalid operation

Looks like the load/insert command generated by pipeline has a syntax error. How can I see the commands this pipeline is firing on...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Error when writing to Redshift

17/12/01 01:59:12 ERROR TaskSetManager: Task 1 in stage 3.0 failed 4 times; aborting job 17/12/01 01:59:12 ERROR FileFormatWriter: Aborting job null.

======== spark-shell --jars spark-redshift_2.10-2.0.0.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar,minimal-json-0.9.4.jar,spark-avro_2.11-3.0.0.jar

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Spark Version 2.3.0 it not work

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException