Failed to find data source: com.databricks.spark.redshift
See original GitHub issueI’m attempting to use spark-redshift on my mac, and I’m getting the error at the end of the post. I found #230 which is related but not resolved, so I’ve opened a new issue. Any idea what I’m doing wrong?
Here’s the call:
df = sqlcontext.read \
.format('com.databricks.spark.redshift') \
.option(URL_STR, REDSHIFT_JDBC_URL) \
.option(QUERY_STR, query) \
.option(TEMPDIR_STR, AWS_S3_BUCKET_TEMP_DIR) \
.load()
Here is the spark-submit
call:
spark-submit --packages com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0 ...
And here’s my spark-defaults.conf
:
spark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath /usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/usr/local/opt/redshift-jdbc/libexec/RedshiftJDBC4-1.1.7.1007.jar
spark.executor.extraClassPath /usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/usr/local/opt/redshift-jdbc/libexec/RedshiftJDBC4-1.1.7.1007.jar
Here’s the error:
File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 153, in load
File "/usr/local/opt/apache-spark/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/opt/apache-spark/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
... 16 more
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Error while Connecting PySpark to AWS Redshift
I had to include 4 jar files in the EMR spark-submit options to get this working. List of jar files: 1.RedshiftJDBC41-1.2.12.1017.jar.
Read more >Redshift Data Source for Apache Spark
A library to load data into Spark SQL DataFrames from Amazon Redshift, ... Get some data from a Redshift table val df: DataFrame...
Read more >Query Amazon Redshift with Databricks
Once you have configured your AWS credentials, you can use the data source with the Spark data source API in Python, SQL, R,...
Read more >Connecting to Redshift Data Source from Spark
Spark on Qubole supports the Spark Redshift connector, which is a library that lets you load data from Amazon Redshift tables into Spark...
Read more >spark-redshift
Redshift Data Source for Apache Spark ... A library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Are you sure that
spark-submit
received the--packages
argument? The following example didn’t error out in my environment:Make sure that the
--packages
precedes the name of the main file / class, since arguments after that will be passed to your program and not tospark-submit
.Your example works for me using Spark 2.0.1, so I’m closing this issue. Thanks!