Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark NLP pipeline runs much faster on PySpark 3.0.x compare to PySpark 3.1.x

See original GitHub issue

Description

I have a dataset of around 2 million amazon reviews, I want to count most frequent words. For that I am tokenizing and removing stop words. I wanted to use spark-nlp to create a more sophisticated pipeline than that for later stages but even this simple one is not working for me. On the other hand an equivalent (?) pipeline in plain spark works correctly. Note that when I do out.show() on the spark-nlp pipeline output it shows me a correctly tokenized lists of words.

Expected Behavior

Pipeline should clean the dataset and count most frequent words

Current Behavior

Pipeline freezes

Possible Solution

No idea

Steps to Reproduce

Plain spark working pipeline

import pyspark.sql
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
import time

conf = pyspark.SparkConf().setMaster("local[*]").setAll([
                                   ('spark.executor.memory', '12g'),
                                   ('spark.driver.memory','4g'), 
                                   ('spark.driver.maxResultSize', '2G')
                                  ])
# create the session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# create the context
sc = spark.sparkContext

# FIX for Spark 2.x
locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("en-US"))

data_path = "../data/toys-cleaned.csv.gz"
Toys = spark.read.options(header= True, delimiter=',', inferSchema=True).csv(data_path)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

toys_with_tokens.show(5)

# get all words in a single dataframe
start = time.time()
all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k 
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
print(time.time() - start)

Spark nlp pipeline - not working

import sparknlp
spark = sparknlp.start()

from sparknlp.base import Finisher, DocumentAssembler
from sparknlp.annotator import (Tokenizer, Normalizer,
                                LemmatizerModel, StopWordsCleaner)
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
     .setInputCol('reviewText') \
     .setOutputCol('document')
tokenizer = Tokenizer() \
     .setInputCols(['document']) \
     .setOutputCol('token')
# note normalizer defaults to changing all words to lowercase.
# Use .setLowercase(False) to maintain input case.
normalizer = Normalizer() \
     .setInputCols(['token']) \
     .setOutputCol('normalized') \
     .setLowercase(True)
# note that lemmatizer needs a dictionary. So I used the pre-trained
# model (note that it defaults to english)
# lemmatizer = LemmatizerModel.pretrained() \
#      .setInputCols(['normalized']) \
#      .setOutputCol('lemma')
stopwords_cleaner = StopWordsCleaner().pretrained("stopwords_en", "en") \
     .setInputCols(['normalized']) \
     .setOutputCol('clean') \
     .setCaseSensitive(False) \

# finisher converts tokens to human-readable output
finisher = Finisher() \
     .setInputCols(['clean']) \
     .setCleanAnnotations(False)

pipeline = Pipeline() \
     .setStages([
           documentAssembler,
           tokenizer,
           normalizer,
           stopwords_cleaner,
           finisher
     ])

data_path = "../data/toys-cleaned.csv.gz"
Toys = spark.read.options(header= True, delimiter=',', inferSchema=True).csv(data_path)

out = pipeline.fit(Toys).transform(Toys)

# get all words in a single dataframe
import time
start = time.time()
all_words = out.select(explode("finished_clean").alias("word"))
# group by, sort and limit to 50k 
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
print(time.time() - start)

Context

Trying to clean data with spark-nlp and perform some analysis, on a later stage I would like to use spark-nlp to process data for some classification task.

Your Environment

Spark NLP version sparknlp.version(): spark-nlp==3.0.1
Apache NLP version spark.version: pyspark==3.1.1
Java version java -version: openjdk version “1.8.0_282” OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Setup and installation (Pypi, Conda, Maven, etc.): virtualenv==16.3.0 virtualenv-clone==0.5.1 virtualenvwrapper==4.8.2
Operating System and version: Ubuntu 18.04
Link to your project (if any): Link to the dataset: https://drive.google.com/file/d/1ltokXXtsmXiBkUMGDyt9GAqnz6LBtx8s/view?usp=sharing