Spark NLP pipeline runs much faster on PySpark 3.0.x compare to PySpark 3.1.x
See original GitHub issueDescription
I have a dataset of around 2 million amazon reviews, I want to count most frequent words. For that I am tokenizing and removing stop words. I wanted to use spark-nlp to create a more sophisticated pipeline than that for later stages but even this simple one is not working for me. On the other hand an equivalent (?) pipeline in plain spark works correctly. Note that when I do out.show()
on the spark-nlp pipeline output it shows me a correctly tokenized lists of words.
Expected Behavior
Pipeline should clean the dataset and count most frequent words
Current Behavior
Pipeline freezes
Possible Solution
No idea
Steps to Reproduce
Plain spark working pipeline
import pyspark.sql
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
import time
conf = pyspark.SparkConf().setMaster("local[*]").setAll([
('spark.executor.memory', '12g'),
('spark.driver.memory','4g'),
('spark.driver.maxResultSize', '2G')
])
# create the session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# create the context
sc = spark.sparkContext
# FIX for Spark 2.x
locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("en-US"))
data_path = "../data/toys-cleaned.csv.gz"
Toys = spark.read.options(header= True, delimiter=',', inferSchema=True).csv(data_path)
# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)
# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
toys_with_tokens.show(5)
# get all words in a single dataframe
start = time.time()
all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
top50k.show()
print(time.time() - start)
Spark nlp pipeline - not working
import sparknlp
spark = sparknlp.start()
from sparknlp.base import Finisher, DocumentAssembler
from sparknlp.annotator import (Tokenizer, Normalizer,
LemmatizerModel, StopWordsCleaner)
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol('reviewText') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
# note normalizer defaults to changing all words to lowercase.
# Use .setLowercase(False) to maintain input case.
normalizer = Normalizer() \
.setInputCols(['token']) \
.setOutputCol('normalized') \
.setLowercase(True)
# note that lemmatizer needs a dictionary. So I used the pre-trained
# model (note that it defaults to english)
# lemmatizer = LemmatizerModel.pretrained() \
# .setInputCols(['normalized']) \
# .setOutputCol('lemma')
stopwords_cleaner = StopWordsCleaner().pretrained("stopwords_en", "en") \
.setInputCols(['normalized']) \
.setOutputCol('clean') \
.setCaseSensitive(False) \
# finisher converts tokens to human-readable output
finisher = Finisher() \
.setInputCols(['clean']) \
.setCleanAnnotations(False)
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
normalizer,
stopwords_cleaner,
finisher
])
data_path = "../data/toys-cleaned.csv.gz"
Toys = spark.read.options(header= True, delimiter=',', inferSchema=True).csv(data_path)
out = pipeline.fit(Toys).transform(Toys)
# get all words in a single dataframe
import time
start = time.time()
all_words = out.select(explode("finished_clean").alias("word"))
# group by, sort and limit to 50k
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
top50k.show()
print(time.time() - start)
Context
Trying to clean data with spark-nlp and perform some analysis, on a later stage I would like to use spark-nlp to process data for some classification task.
Your Environment
- Spark NLP version
sparknlp.version()
: spark-nlp==3.0.1 - Apache NLP version
spark.version
: pyspark==3.1.1 - Java version
java -version
: openjdk version “1.8.0_282” OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) - Setup and installation (Pypi, Conda, Maven, etc.): virtualenv==16.3.0 virtualenv-clone==0.5.1 virtualenvwrapper==4.8.2
- Operating System and version: Ubuntu 18.04
- Link to your project (if any): Link to the dataset: https://drive.google.com/file/d/1ltokXXtsmXiBkUMGDyt9GAqnz6LBtx8s/view?usp=sharing
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Hi @maziyarpanahi,
thanks so much for the tips and investigation!
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days