Coverage of pyspark user defined function
See original GitHub issueOriginally reported by Abdeali Kothari (Bitbucket: AbdealiJK, GitHub: AbdealiJK)
I have a case where I have some pyspark codes in my code base and I am trying to test them. When doing that, I find that any python UDF I can with spark does not get covered even though I am running it. Note that I am running it in the local spark mode.
Reproducible example:
#!python
def get_new_col(spark, df):
def myadd(x, y):
import sys, os
print("sys.version_info =", sys.version_info)
print({k: v for k, v in os.environ.items() if k.lower().startswith('cov')})
x1 = x
y1 = y
return str(float(x1) + float(y1))
spark.udf.register('myadd', myadd)
return df.selectExpr(['*', 'myadd(x, y) as newcol'])
def run():
try:
import findspark
findspark.init()
except ImportError:
pass
import pyspark
spark = pyspark.sql.SparkSession.Builder().master("local[2]").getOrCreate()
df = spark.createDataFrame([
[1.0, 1.0],
[1.0, 2.0],
[1.0, 2.0]
], ['x', 'y'])
outdf = get_new_col(spark, df)
outdf.show()
outdf.printSchema()
assert outdf.columns == (df.columns + ['newcol'])
spark.stop()
if __name__ == '__main__':
run()
This says the UDF was not covered even though it did run.
Here are the logs when I run it:
#!python
$ coverage run example.py
2018-05-04 14:58:29 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-05-04 14:58:30 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[Stage 0:> (0 + 1) / 1]sys.version_info = sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
{'COVERAGE_PROCESS_START': ''}
sys.version_info = sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
{'COVERAGE_PROCESS_START': ''}
sys.version_info = sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
{'COVERAGE_PROCESS_START': ''}
+---+---+------+
| x| y|newcol|
+---+---+------+
|1.0|1.0| 2.0|
|1.0|2.0| 3.0|
|1.0|2.0| 3.0|
+---+---+------+
root
|-- x: double (nullable = true)
|-- y: double (nullable = true)
|-- newcol: string (nullable = true)
Relevant packages: Python 3.6.4 :: Anaconda, Inc. coverage (4.5.1)
Edit 1: Simplified the reproducible example to remove unittest and pytest.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:19 (12 by maintainers)
Top Results From Across the Web
PySpark UDF (User Defined Function) - Spark by {Examples}
PySpark UDF (a.k.a User Defined Function) is the most useful feature of Spark SQL & DataFrame that is used to extend the PySpark...
Read more >pyspark.sql.functions.udf - Apache Spark
Creates a user defined function (UDF). New in version 1.3.0. Parameters: ffunction ... The user-defined functions are considered deterministic by default.
Read more >Coverage for pyspark/sql/udf.py: 74% - GitHub Pages
User -defined function related classes and functions ... Use :meth:`pyspark.sql.functions.udf` or :meth:`pyspark.sql.functions.pandas_udf`.
Read more >How to Write Spark UDFs (User Defined Functions) in Python
In this article, I'll explain how to write user defined functions (UDF) in Python for Apache Spark. The code for this example is...
Read more >UDFs in Pyspark - Medium
Pyspark sql library itself provides a wide variety of functions to apply on a data frame, but we can define our own functions...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Does someone have a workaround ? 😢
@AndrewLane experimenting a bit with this, my guess is that the code is running in a subprocess, but that process is started in a way that doesn’t get coverage started on it, perhaps because it’s started from Java.