question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when writing to Redshift

See original GitHub issue

Hi, I am getting the following error when trying to write to Redshift from EMR/Spark. I am able to read from Redshift successfully. I am using Spark 2.2.0 on EMR and the databricks-redshift driver.

Appreciate any help to get this resolved quickly.

Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String

See details below.

Thanks!

**** DETAILS **** 17/12/01 01:25:43 WARN Utils$: The S3 bucket XXXXX does not have an object lifecycle configuration to ensure cleanup of temporary files. Consider configuring tempdir to point to a bucket with an object lifecycle policy that automatically deletes files after an expiration period. For more information, see https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html

[Stage 22:> (0 + 2) / 2]17/12/01 01:27:28 WARN TaskSetManager: Lost task 1.0 in stage 22.0 (TID 1234, ip-nnn-nnn-nnn.us-east-2.compute.internal, executor 11): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String; at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) … 8 more

17/12/01 01:59:12 ERROR TaskSetManager: Task 1 in stage 3.0 failed 4 times; aborting job 17/12/01 01:59:12 ERROR FileFormatWriter: Aborting job null.

This is while trying to do the demo at AWS Big Data blog: Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning (https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/#more-1340). My EMR cluster has Spark 2.2.0, and I invoked Spark as below:

======== spark-shell --jars spark-redshift_2.10-2.0.0.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar,minimal-json-0.9.4.jar,spark-avro_2.11-3.0.0.jar

Here is the code snippet used for writing…

val s3TempDir2 = “s3://<mybucket>/predict_flight_delay_with_Spark/output2/”

flightsDF.write .format(“com.databricks.spark.redshift”) .option(“temporary_aws_access_key_id”, awsAccessKey) .option(“temporary_aws_secret_access_key”, awsSecretKey) .option(“temporary_aws_session_token”, token) .option(“url”, jdbcURL) .option(“dbtable”, “ord_flights_new”) .option(“aws_iam_role”, “arn:aws:iam::xxxxxxxxx:role/RedShiftFullAccess”) .option(“tempdir”, s3TempDir2) .mode(SaveMode.Overwrite) .save()

============ print (firstFlightsDF) [2702961,10,2,2017-01-10,6,AA,1700,13,0,89,1,599]

flightsDF.printSchema() root |-- id: long (nullable = true) |-- day_of_month: integer (nullable = true) |-- day_of_week: integer (nullable = true) |-- fl_date: date (nullable = true) |-- days_to_holiday: integer (nullable = true) |-- unique_carrier: string (nullable = true) |-- fl_num: string (nullable = true) |-- dep_hour: string (nullable = true) |-- dep_del15: integer (nullable = true) |-- air_time: integer (nullable = true) |-- flights: integer (nullable = true) |-- distance: integer (nullable = true)

flightsDF: org.apache.spark.sql.DataFrame = [id: bigint, day_of_month: int … 10 more fields] firstFlightsDF: org.apache.spark.sql.Row = [2702961,10,2,2017-01-10,6,AA,1700,13,0,89,1,599]

=============

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:14

github_iconTop GitHub Comments

1reaction
GabeChurchcommented, Jun 13, 2018

I’ll go ahead and post the solution that I came up with. Basically what I’ve discovered is that there are some compatibility issues using certain hadoop-aws jar to write to s3 with the s3a format. By switching to the s3n format you can combat this, the hadoop-aws 2.7.4 jar seems to be the consensus safest choice.

The aformentioned error also seems to be directly associated with an avro formatting problem.

I was able to write to redshift with all of the aforementioned combinations when using the following code to write.

df.write
    .format("com.databricks.spark.redshift")
    .option("url", "jdbc:redshift://yoururl$user=yourUser&password=your_password")
    .option("dbtable", "optionaldbname.tablename")
    .option("forward_spark_s3_credentials",true)
    .option("tempFormat", "CSV GZIP")  //default is avro, can also use CSV and CSV GZIP formats
    .option("tempdir", "s3n://myS3/path/to/bucket")
    .mode(SaveMode.Overwrite)
    .save()

Still interested in getting avro to work

0reactions
zbinkleytestcommented, Jun 18, 2019

Has anyone been able to solve this in Spark 2.4.3? I am using the following jars and packages when running a local instance, and have no trouble reading from Redshift. I am getting the same error as above when writing however (even when switching to csv format).

spark = SparkSession.builder.master(“local”).appName(“Test”)
.config(“spark.jars”, ‘RedshiftJDBC4-1.2.1.1001.jar,jets3t-0.9.0.jar,spark-avro_2.11-4.0.0.jar,hadoop-aws-2.7.4.jar’)
.config(“spark.jars.packages”, ‘com.databricks:spark-redshift_2.10:0.5.0,com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.7.4’)
.getOrCreate()

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to fix errors when writing to a Redshift DB? - Openprise
To fix the problem, have your Redshift administrator apply the commands below: · GRANT ALL PRIVILEGES ON TABLE <table_name> TO <user>; · or...
Read more >
Troubleshoot Amazon Redshift connection errors
Invalid operation connection error · 1. Open the Amazon Redshift console. · 2. Choose the Config tab. · 3. Modify the parameter group...
Read more >
Spark - Read only error while trying to write to Redshift
sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only; at com.amazon.redshift.client.
Read more >
Error while initializing the writer : [Unsupported COPY ...
Starting the August 2016 (August 13th, 2016) release of the Amazon Redshift connector, validation check for the supported format has been ...
Read more >
Error writing Redshift-[Amazon](500310) Invalid operation
Looks like the load/insert command generated by pipeline has a syntax error. How can I see the commands this pipeline is firing on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found