question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark NLP pipeline runs much faster on PySpark 3.0.x compare to PySpark 3.1.x

See original GitHub issue

Description

I have a dataset of around 2 million amazon reviews, I want to count most frequent words. For that I am tokenizing and removing stop words. I wanted to use spark-nlp to create a more sophisticated pipeline than that for later stages but even this simple one is not working for me. On the other hand an equivalent (?) pipeline in plain spark works correctly. Note that when I do out.show() on the spark-nlp pipeline output it shows me a correctly tokenized lists of words.

Expected Behavior

Pipeline should clean the dataset and count most frequent words

Current Behavior

Pipeline freezes

Possible Solution

No idea

Steps to Reproduce

Plain spark working pipeline

import pyspark.sql
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
import time

conf = pyspark.SparkConf().setMaster("local[*]").setAll([
                                   ('spark.executor.memory', '12g'),
                                   ('spark.driver.memory','4g'), 
                                   ('spark.driver.maxResultSize', '2G')
                                  ])
# create the session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# create the context
sc = spark.sparkContext

# FIX for Spark 2.x
locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("en-US"))

data_path = "../data/toys-cleaned.csv.gz"
Toys = spark.read.options(header= True, delimiter=',', inferSchema=True).csv(data_path)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

toys_with_tokens.show(5)

# get all words in a single dataframe
start = time.time()
all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k 
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
print(time.time() - start)

Spark nlp pipeline - not working

import sparknlp
spark = sparknlp.start()

from sparknlp.base import Finisher, DocumentAssembler
from sparknlp.annotator import (Tokenizer, Normalizer,
                                LemmatizerModel, StopWordsCleaner)
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
     .setInputCol('reviewText') \
     .setOutputCol('document')
tokenizer = Tokenizer() \
     .setInputCols(['document']) \
     .setOutputCol('token')
# note normalizer defaults to changing all words to lowercase.
# Use .setLowercase(False) to maintain input case.
normalizer = Normalizer() \
     .setInputCols(['token']) \
     .setOutputCol('normalized') \
     .setLowercase(True)
# note that lemmatizer needs a dictionary. So I used the pre-trained
# model (note that it defaults to english)
# lemmatizer = LemmatizerModel.pretrained() \
#      .setInputCols(['normalized']) \
#      .setOutputCol('lemma')
stopwords_cleaner = StopWordsCleaner().pretrained("stopwords_en", "en") \
     .setInputCols(['normalized']) \
     .setOutputCol('clean') \
     .setCaseSensitive(False) \

# finisher converts tokens to human-readable output
finisher = Finisher() \
     .setInputCols(['clean']) \
     .setCleanAnnotations(False)

pipeline = Pipeline() \
     .setStages([
           documentAssembler,
           tokenizer,
           normalizer,
           stopwords_cleaner,
           finisher
     ])

data_path = "../data/toys-cleaned.csv.gz"
Toys = spark.read.options(header= True, delimiter=',', inferSchema=True).csv(data_path)

out = pipeline.fit(Toys).transform(Toys)

# get all words in a single dataframe
import time
start = time.time()
all_words = out.select(explode("finished_clean").alias("word"))
# group by, sort and limit to 50k 
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
print(time.time() - start)

Context

Trying to clean data with spark-nlp and perform some analysis, on a later stage I would like to use spark-nlp to process data for some classification task.

Your Environment

  • Spark NLP version sparknlp.version(): spark-nlp==3.0.1
  • Apache NLP version spark.version: pyspark==3.1.1
  • Java version java -version: openjdk version “1.8.0_282” OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
  • Setup and installation (Pypi, Conda, Maven, etc.): virtualenv==16.3.0 virtualenv-clone==0.5.1 virtualenvwrapper==4.8.2
  • Operating System and version: Ubuntu 18.04
  • Link to your project (if any): Link to the dataset: https://drive.google.com/file/d/1ltokXXtsmXiBkUMGDyt9GAqnz6LBtx8s/view?usp=sharing

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jczestochowskacommented, Apr 8, 2021

Hi @maziyarpanahi,

thanks so much for the tips and investigation!

0reactions
github-actions[bot]commented, Feb 17, 2022

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

Read more comments on GitHub >

github_iconTop Results From Across the Web

General Concepts - Spark NLP - John Snow Labs
Concepts. Spark ML provides a set of Machine Learning applications that can be build using two main components: Estimators and Transformers.
Read more >
ML Pipelines - Spark 3.3.1 Documentation
In this section, we introduce the concept of ML Pipelines. ML Pipelines provide a uniform set of high-level APIs built on top of...
Read more >
How to Get Started with Spark NLP in 2 Weeks — Part I
Well, if you know SQL, PySpark and Spark NLP is not going to feel like another ... 3) Unless you have to run...
Read more >
Optimizing and Improving Spark 3.0 Performance with GPUs
Discuss (3) ... Spark 2.x static partition pruning improves performance by allowing ... GPUs are capable of processing data much faster than ...
Read more >
Scala vs. Python for Apache Spark - ProjectPro
Python- Which is a better programming language for Apache Spark?”. The criticism from data scientists on choosing either Scala Spark or ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found