Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to find data source: com.databricks.spark.redshift

See original GitHub issue

I’m attempting to use spark-redshift on my mac, and I’m getting the error at the end of the post. I found #230 which is related but not resolved, so I’ve opened a new issue. Any idea what I’m doing wrong?

Here’s the call:

df = sqlcontext.read \
               .format('com.databricks.spark.redshift') \
               .option(URL_STR, REDSHIFT_JDBC_URL) \
               .option(QUERY_STR, query) \
               .option(TEMPDIR_STR, AWS_S3_BUCKET_TEMP_DIR) \
               .load()

Here is the spark-submit call:

spark-submit --packages com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0 ...

And here’s my spark-defaults.conf:

spark.hadoop.fs.s3n.impl           org.apache.hadoop.fs.s3native.NativeS3FileSystem
spark.hadoop.fs.s3.impl            org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath        /usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/usr/local/opt/redshift-jdbc/libexec/RedshiftJDBC4-1.1.7.1007.jar
spark.executor.extraClassPath      /usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/usr/local/opt/redshift-jdbc/libexec/RedshiftJDBC4-1.1.7.1007.jar

Here’s the error:

  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 153, in load
  File "/usr/local/opt/apache-spark/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/opt/apache-spark/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
    at scala.util.Try.orElse(Try.scala:84)
    at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
    ... 16 more

Issue Analytics

State:
Created 7 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

JoshRosencommented, Oct 19, 2016

Are you sure that spark-submit received the --packages argument? The following example didn’t error out in my environment:

# in test.py
from pyspark import SQLContext, SparkContext, SparkConf

df = SQLContext.getOrCreate(SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))).read \
               .format('com.databricks.spark.redshift') \
               .load()

./bin/spark-submit --packages com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0 test.py

Make sure that the --packages precedes the name of the main file / class, since arguments after that will be passed to your program and not to spark-submit.

0reactions

proinsiascommented, Oct 21, 2016

Your example works for me using Spark 2.0.1, so I’m closing this issue. Thanks!