Error when writing to Redshift
See original GitHub issueHi, I am getting the following error when trying to write to Redshift from EMR/Spark. I am able to read from Redshift successfully. I am using Spark 2.2.0 on EMR and the databricks-redshift driver.
Appreciate any help to get this resolved quickly.
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String
See details below.
Thanks!
**** DETAILS ****
17/12/01 01:25:43 WARN Utils$: The S3 bucket XXXXX does not have an object lifecycle configuration to ensure cleanup of temporary files. Consider configuring tempdir
to point to a bucket with an object lifecycle policy that automatically deletes files after an expiration period. For more information, see https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
[Stage 22:> (0 + 2) / 2]17/12/01 01:27:28 WARN TaskSetManager: Lost task 1.0 in stage 22.0 (TID 1234, ip-nnn-nnn-nnn.us-east-2.compute.internal, executor 11): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String; at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) … 8 more
17/12/01 01:59:12 ERROR TaskSetManager: Task 1 in stage 3.0 failed 4 times; aborting job 17/12/01 01:59:12 ERROR FileFormatWriter: Aborting job null.
This is while trying to do the demo at AWS Big Data blog: Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning (https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/#more-1340). My EMR cluster has Spark 2.2.0, and I invoked Spark as below:
======== spark-shell --jars spark-redshift_2.10-2.0.0.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar,minimal-json-0.9.4.jar,spark-avro_2.11-3.0.0.jar
Here is the code snippet used for writing…
val s3TempDir2 = “s3://<mybucket>/predict_flight_delay_with_Spark/output2/”
flightsDF.write .format(“com.databricks.spark.redshift”) .option(“temporary_aws_access_key_id”, awsAccessKey) .option(“temporary_aws_secret_access_key”, awsSecretKey) .option(“temporary_aws_session_token”, token) .option(“url”, jdbcURL) .option(“dbtable”, “ord_flights_new”) .option(“aws_iam_role”, “arn:aws:iam::xxxxxxxxx:role/RedShiftFullAccess”) .option(“tempdir”, s3TempDir2) .mode(SaveMode.Overwrite) .save()
============ print (firstFlightsDF) [2702961,10,2,2017-01-10,6,AA,1700,13,0,89,1,599]
flightsDF.printSchema() root |-- id: long (nullable = true) |-- day_of_month: integer (nullable = true) |-- day_of_week: integer (nullable = true) |-- fl_date: date (nullable = true) |-- days_to_holiday: integer (nullable = true) |-- unique_carrier: string (nullable = true) |-- fl_num: string (nullable = true) |-- dep_hour: string (nullable = true) |-- dep_del15: integer (nullable = true) |-- air_time: integer (nullable = true) |-- flights: integer (nullable = true) |-- distance: integer (nullable = true)
flightsDF: org.apache.spark.sql.DataFrame = [id: bigint, day_of_month: int … 10 more fields] firstFlightsDF: org.apache.spark.sql.Row = [2702961,10,2,2017-01-10,6,AA,1700,13,0,89,1,599]
=============
Issue Analytics
- State:
- Created 6 years ago
- Comments:14
Top GitHub Comments
I’ll go ahead and post the solution that I came up with. Basically what I’ve discovered is that there are some compatibility issues using certain hadoop-aws jar to write to s3 with the s3a format. By switching to the s3n format you can combat this, the hadoop-aws 2.7.4 jar seems to be the consensus safest choice.
The aformentioned error also seems to be directly associated with an avro formatting problem.
I was able to write to redshift with all of the aforementioned combinations when using the following code to write.
Still interested in getting avro to work
Has anyone been able to solve this in Spark 2.4.3? I am using the following jars and packages when running a local instance, and have no trouble reading from Redshift. I am getting the same error as above when writing however (even when switching to csv format).
spark = SparkSession.builder.master(“local”).appName(“Test”)
.config(“spark.jars”, ‘RedshiftJDBC4-1.2.1.1001.jar,jets3t-0.9.0.jar,spark-avro_2.11-4.0.0.jar,hadoop-aws-2.7.4.jar’)
.config(“spark.jars.packages”, ‘com.databricks:spark-redshift_2.10:0.5.0,com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.7.4’)
.getOrCreate()