question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to find data source: com.databricks.spark.redshift

See original GitHub issue

I’m attempting to use spark-redshift on my mac, and I’m getting the error at the end of the post. I found #230 which is related but not resolved, so I’ve opened a new issue. Any idea what I’m doing wrong?

Here’s the call:

df = sqlcontext.read \
               .format('com.databricks.spark.redshift') \
               .option(URL_STR, REDSHIFT_JDBC_URL) \
               .option(QUERY_STR, query) \
               .option(TEMPDIR_STR, AWS_S3_BUCKET_TEMP_DIR) \
               .load()

Here is the spark-submit call:

spark-submit --packages com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0 ...

And here’s my spark-defaults.conf:

spark.hadoop.fs.s3n.impl           org.apache.hadoop.fs.s3native.NativeS3FileSystem
spark.hadoop.fs.s3.impl            org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath        /usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/usr/local/opt/redshift-jdbc/libexec/RedshiftJDBC4-1.1.7.1007.jar
spark.executor.extraClassPath      /usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/usr/local/opt/redshift-jdbc/libexec/RedshiftJDBC4-1.1.7.1007.jar

Here’s the error:

  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 153, in load
  File "/usr/local/opt/apache-spark/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/opt/apache-spark/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
    at scala.util.Try.orElse(Try.scala:84)
    at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
    ... 16 more

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
JoshRosencommented, Oct 19, 2016

Are you sure that spark-submit received the --packages argument? The following example didn’t error out in my environment:

# in test.py
from pyspark import SQLContext, SparkContext, SparkConf

df = SQLContext.getOrCreate(SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))).read \
               .format('com.databricks.spark.redshift') \
               .load()
./bin/spark-submit --packages com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0 test.py

Make sure that the --packages precedes the name of the main file / class, since arguments after that will be passed to your program and not to spark-submit.

0reactions
proinsiascommented, Oct 21, 2016

Your example works for me using Spark 2.0.1, so I’m closing this issue. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error while Connecting PySpark to AWS Redshift
I had to include 4 jar files in the EMR spark-submit options to get this working. List of jar files: 1.RedshiftJDBC41-1.2.12.1017.jar.
Read more >
Redshift Data Source for Apache Spark
A library to load data into Spark SQL DataFrames from Amazon Redshift, ... Get some data from a Redshift table val df: DataFrame...
Read more >
Query Amazon Redshift with Databricks
Once you have configured your AWS credentials, you can use the data source with the Spark data source API in Python, SQL, R,...
Read more >
Connecting to Redshift Data Source from Spark
Spark on Qubole supports the Spark Redshift connector, which is a library that lets you load data from Amazon Redshift tables into Spark...
Read more >
spark-redshift
Redshift Data Source for Apache Spark ... A library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found