question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pyspark on EMR not able to import

See original GitHub issue

I am attempting to run this on an AWS EMR cluster with PySpark. It runs just fine in spark-shell, but I cannot import the package via pyspark.

Description

I run pyspark from the CLI via:

pyspark --packages JohnSnowLabs:spark-nlp:1.4.0

which gets the shell started with a few warning lines that are not usually present:

18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/JohnSnowLabs_spark-nlp-1.4.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.typesafe_config-1.3.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.rocksdb_rocksdbjni-5.8.0.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.15.jar added multiple times to distributed cache.
18/02/15 17:28:57 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.objenesis_objenesis-2.6.jar added multiple times to distributed cache.

I immediately try to do from sparknlp.annotator import * and get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named sparknlp.annotator

Possible Solution

More of a question really…does your package work in Python2 as well as Python3? Didn’t see that in the docs anywhere, but maybe I missed it.

Context

AWS Spark EMR cluster which uses the Amazon Linux AMI.

Your Environment

AWS Spark EMR cluster which uses the Amazon Linux AMI. For versioning info, here is my full stream for when I start up pyspark with the package…

$ pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
Python 2.7.13 (default, Jan 31 2018, 00:17:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found JohnSnowLabs#spark-nlp;1.4.0 in spark-packages
	found com.typesafe#config;1.3.0 in central
	found org.rocksdb#rocksdbjni;5.8.0 in central
	found org.slf4j#slf4j-api;1.7.25 in spark-list
	found org.apache.commons#commons-compress;1.15 in central
	found org.objenesis#objenesis;2.6 in central
:: resolution report :: resolve 282ms :: artifacts dl 5ms
	:: modules in use:
	JohnSnowLabs#spark-nlp;1.4.0 from spark-packages in [default]
	com.typesafe#config;1.3.0 from central in [default]
	org.apache.commons#commons-compress;1.15 from central in [default]
	org.objenesis#objenesis;2.6 from central in [default]
	org.rocksdb#rocksdbjni;5.8.0 from central in [default]
	org.slf4j#slf4j-api;1.7.25 from spark-list in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   6   |   0   |   0   |   0   ||   6   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 6 already retrieved (0kB/7ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/15 17:41:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/JohnSnowLabs_spark-nlp-1.4.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.typesafe_config-1.3.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.rocksdb_rocksdbjni-5.8.0.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.15.jar added multiple times to distributed cache.
18/02/15 17:41:42 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.objenesis_objenesis-2.6.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.13 (default, Jan 31 2018 00:17:36)
SparkSession available as 'spark'.
>>> from sparknlp.annotator import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named sparknlp.annotator
>>> 

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
bkestelmancommented, Apr 4, 2020

For anyone still having an import problem on AWS EMR using python3, make sure to install sparknlp with the correct version of pip (/usr/bin/pip-3.6 currently). Then it should work specifying --packages as usual without any PYTHONPATH tinkering.

0reactions
maziyarpanahicommented, Oct 26, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

Amazon EMR Pyspark Module not found - Stack Overflow
Try using findspark: Install via shell using pip install findspark . Sample code: # Import package(s). import findspark findspark.init() from ...
Read more >
Configure Amazon EMR to run a PySpark job using Python 3.x
On a running cluster · 1. Connect to the master node using SSH. · 2. Run the following command to change the default...
Read more >
pyspark gets stuck in Running due to import issue - AWS re:Post
In short: I run a pySpark application on AWS's EMR. When I map an rdd using a custom function that resides in an...
Read more >
Python: No module named 'pyspark' Error - Spark by {Examples}
In summary, you can resolve No module named pyspark error by importing modules/libraries in PySpark (shell/script) either by setting the right environment ...
Read more >
Train an ML Model using Apache Spark in EMR and deploy in ...
You need to have the MLeap JARs in the classpath to be successfully able to use it during model serialization. Please download the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found