pyspark on EMR not able to import
See original GitHub issueI am attempting to run this on an AWS EMR cluster with PySpark. It runs just fine in spark-shell, but I cannot import the package via pyspark.
Description
I run pyspark from the CLI via:
pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
which gets the shell started with a few warning lines that are not usually present:
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/JohnSnowLabs_spark-nlp-1.4.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.typesafe_config-1.3.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.rocksdb_rocksdbjni-5.8.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.15.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.objenesis_objenesis-2.6.jar added multiple times to distributed cache.
I immediately try to do from sparknlp.annotator import * and get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named sparknlp.annotator
Possible Solution
More of a question really…does your package work in Python2 as well as Python3? Didn’t see that in the docs anywhere, but maybe I missed it.
Context
AWS Spark EMR cluster which uses the Amazon Linux AMI.
Your Environment
AWS Spark EMR cluster which uses the Amazon Linux AMI. For versioning info, here is my full stream for when I start up pyspark with the package…
$ pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
Python 2.7.13 (default, Jan 31 2018, 00:17:36)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found JohnSnowLabs#spark-nlp;1.4.0 in spark-packages
found com.typesafe#config;1.3.0 in central
found org.rocksdb#rocksdbjni;5.8.0 in central
found org.slf4j#slf4j-api;1.7.25 in spark-list
found org.apache.commons#commons-compress;1.15 in central
found org.objenesis#objenesis;2.6 in central
:: resolution report :: resolve 282ms :: artifacts dl 5ms
:: modules in use:
JohnSnowLabs#spark-nlp;1.4.0 from spark-packages in [default]
com.typesafe#config;1.3.0 from central in [default]
org.apache.commons#commons-compress;1.15 from central in [default]
org.objenesis#objenesis;2.6 from central in [default]
org.rocksdb#rocksdbjni;5.8.0 from central in [default]
org.slf4j#slf4j-api;1.7.25 from spark-list in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 6 | 0 | 0 | 0 || 6 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 6 already retrieved (0kB/7ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/15 17:41:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/JohnSnowLabs_spark-nlp-1.4.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.typesafe_config-1.3.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.rocksdb_rocksdbjni-5.8.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.15.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.objenesis_objenesis-2.6.jar added multiple times to distributed cache.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Python version 2.7.13 (default, Jan 31 2018 00:17:36)
SparkSession available as 'spark'.
>>> from sparknlp.annotator import *
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named sparknlp.annotator
>>>
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Amazon EMR Pyspark Module not found - Stack Overflow
Try using findspark: Install via shell using pip install findspark . Sample code: # Import package(s). import findspark findspark.init() from ...
Read more >Configure Amazon EMR to run a PySpark job using Python 3.x
On a running cluster · 1. Connect to the master node using SSH. · 2. Run the following command to change the default...
Read more >pyspark gets stuck in Running due to import issue - AWS re:Post
In short: I run a pySpark application on AWS's EMR. When I map an rdd using a custom function that resides in an...
Read more >Python: No module named 'pyspark' Error - Spark by {Examples}
In summary, you can resolve No module named pyspark error by importing modules/libraries in PySpark (shell/script) either by setting the right environment ...
Read more >Train an ML Model using Apache Spark in EMR and deploy in ...
You need to have the MLeap JARs in the classpath to be successfully able to use it during model serialization. Please download the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

For anyone still having an import problem on AWS EMR using python3, make sure to install sparknlp with the correct version of pip (/usr/bin/pip-3.6 currently). Then it should work specifying --packages as usual without any PYTHONPATH tinkering.
EMR instructions: https://github.com/JohnSnowLabs/spark-nlp#emr-cluster