question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

com.databricks.spark.redshift.RedshiftFileFormat.supportDataType(Lorg/apache/spark/sql/types/DataType;Z)Z

See original GitHub issue

I’m attempting to do a simple query from a redshift table and receiving the following error.

Exception in thread "main" java.lang.AbstractMethodError: com.databricks.spark.redshift.RedshiftFileFormat.supportDataType(Lorg/apache/spark/sql/types/DataType;Z)Z
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:48)
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:47)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:47)
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyReadSchema(DataSourceUtils.scala:39)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:400)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:168)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:403)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3360)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
	at LogChecks$.main(LogChecks.scala:90)
	at LogChecks.main(LogChecks.scala)
18/12/12 10:14:24 INFO SparkContext: Invoking stop() from shutdown hook
18/12/12 10:14:24 INFO SparkUI: Stopped Spark web UI at http://10.75.8.183:4040
18/12/12 10:14:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/12/12 10:14:24 INFO MemoryStore: MemoryStore cleared
18/12/12 10:14:24 INFO BlockManager: BlockManager stopped
18/12/12 10:14:24 INFO BlockManagerMaster: BlockManagerMaster stopped
18/12/12 10:14:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/12/12 10:14:24 INFO SparkContext: Successfully stopped SparkContext
18/12/12 10:14:24 INFO ShutdownHookManager: Shutdown hook called
18/12/12 10:14:24 INFO ShutdownHookManager: Deleting directory /private/var/folders/01/333yn1tn7rb93sspgwp_z_p804ccgt/T/spark-7c91c661-11e5-4b22-8d9f-2afe851dcdd6

CheckRedshift.scala

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object CheckRedshift {
    def main(args: Array[String]): Unit = {

        val awsAccessKeyId="asdfasdf"
        val awsSecretAccessKey="asdlfjakfalksdf"
        val redshiftDBName = "dbName"
        val redshiftUserId = "*****"
        val redshiftPassword = "*****"
        val redshifturl = "blah-blah.us-west-1.redshift.amazonaws.com:5555"
        val jdbcURL = s"jdbc:redshift://$redshifturl/$redshiftDBName?user=$redshiftUserId&password=$redshiftPassword"
        println(jdbcURL)

        val sc = new SparkContext(new SparkConf().setAppName("SparkSQL").setMaster("local"))

        // Configure SparkContext to communicate with AWS
        val tempS3Dir = "s3a://mys3bucket/temp"
        sc.hadoopConfiguration.set("fs.s3a.access.key", awsAccessKeyId)
        sc.hadoopConfiguration.set("fs.s3a.secret.key", awsSecretAccessKey)

        // Create the SQL Context
        val sqlContext = SparkSession.builder.config(sc.getConf).getOrCreate()


        import sqlContext.implicits._

        val myQuery =
            """SELECT * FROM public.redshift_table
                        WHERE date_time >= '2018-01-01'
                        AND date_time < '2018-01-02'"""
        val df = sqlContext.read
          .format("com.databricks.spark.redshift")
          .option("url", jdbcURL)
          .option("tempdir", tempS3Dir)
          .option("forward_spark_s3_credentials", "true")
          .option("query", myQuery)
          .load()
        df.show()


    }
}

build.sbt

name := "log_check"

version := "0.1"

scalaVersion := "2.11.8"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"

libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.9"

//https://github.com/databricks/spark-redshift
resolvers += "jitpack" at "https://jitpack.io"
libraryDependencies += "com.github.databricks" %% "spark-redshift" % "master-SNAPSHOT"

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"

libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.407"

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
libraryDependencies += "net.java.dev.jets3t" % "jets3t" % "0.9.4"

// https://mvnrepository.com/artifact/com.amazon/redshift-jdbc41
resolvers += "redshift" at "http://redshift-maven-repository.s3-website-us-east-1.amazonaws.com/release"
libraryDependencies += "com.amazon.redshift" % "redshift-jdbc4" % "1.2.10.1009"

// Temporary fix for: https://github.com/databricks/spark-redshift/issues/315
dependencyOverrides += "com.databricks" % "spark-avro_2.11" % "3.2.0"

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:12
  • Comments:9

github_iconTop GitHub Comments

6reactions
lucagiovagnolicommented, Jul 2, 2019

So, we’ve started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it’d be very helpful to receive your feedback. Any pull requests are very very welcome too.

https://github.com/spark-redshift-community/spark-redshift

pinging @andybrnr, @Bennyelg, @julienbachmann who were having issues with 2.4

2reactions
julienbachmanncommented, Feb 12, 2019

This issue is related to Spark 2.4. A workaround is to use Spark 2.3.1. Unfortunately it seems nobody is working to add support for 2.4 see #418

It looks like this fork https://github.com/Yelp/spark-redshift added support to spark 2.4 but I didn’t test it. To be checked

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reading data from Amazon redshift in Spark 2.4
I checked online forums and got to know Spark 2.4 has added in built Avro source and this is the reason using databricks,...
Read more >
Query Amazon Redshift with Databricks
The Databricks Redshift data source uses Amazon S3 to efficiently transfer ... Read data from a table df = (spark.read .format("redshift") ...
Read more >
Redshift Data Source for Apache Spark
A library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used...
Read more >
spark-redshift
Redshift Data Source for Apache Spark ... A library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back...
Read more >
New – Amazon Redshift Integration with Apache Spark
To get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found