Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Coverage of pyspark user defined function

See original GitHub issue

Originally reported by Abdeali Kothari (Bitbucket: AbdealiJK, GitHub: AbdealiJK)

I have a case where I have some pyspark codes in my code base and I am trying to test them. When doing that, I find that any python UDF I can with spark does not get covered even though I am running it. Note that I am running it in the local spark mode.

Reproducible example:

#!python

def get_new_col(spark, df):
    def myadd(x, y):
        import sys, os
        print("sys.version_info =", sys.version_info)
        print({k: v for k, v in os.environ.items() if k.lower().startswith('cov')})
        x1 = x
        y1 = y
        return str(float(x1) + float(y1))

    spark.udf.register('myadd', myadd)
    return df.selectExpr(['*', 'myadd(x, y) as newcol'])


def run():
    try:
        import findspark
        findspark.init()
    except ImportError:
        pass
    import pyspark
    spark = pyspark.sql.SparkSession.Builder().master("local[2]").getOrCreate()
    df = spark.createDataFrame([
        [1.0, 1.0],
        [1.0, 2.0],
        [1.0, 2.0]
    ], ['x', 'y'])

    outdf = get_new_col(spark, df)
    outdf.show()
    outdf.printSchema()
    assert outdf.columns == (df.columns + ['newcol'])

    spark.stop()


if __name__ == '__main__':
    run()

This says the UDF was not covered even though it did run.

Here are the logs when I run it:

#!python
$ coverage run example.py
2018-05-04 14:58:29 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-05-04 14:58:30 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[Stage 0:>                                                          (0 + 1) / 1]sys.version_info = sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
{'COVERAGE_PROCESS_START': ''}
sys.version_info = sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
{'COVERAGE_PROCESS_START': ''}
sys.version_info = sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
{'COVERAGE_PROCESS_START': ''}
+---+---+------+
|  x|  y|newcol|
+---+---+------+
|1.0|1.0|   2.0|
|1.0|2.0|   3.0|
|1.0|2.0|   3.0|
+---+---+------+

root
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- newcol: string (nullable = true)

Relevant packages: Python 3.6.4 :: Anaconda, Inc. coverage (4.5.1)

Edit 1: Simplified the reproducible example to remove unittest and pytest.

Bitbucket: https://bitbucket.org/ned/coveragepy/issue/658

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:19 (12 by maintainers)

Top GitHub Comments

1reaction

RaccoonForevercommented, May 23, 2022

Does someone have a workaround ? 😢

1reaction

nedbatcommented, May 7, 2022

@AndrewLane experimenting a bit with this, my guess is that the code is running in a subprocess, but that process is started in a way that doesn’t get coverage started on it, perhaps because it’s started from Java.

Top Results From Across the Web

PySpark UDF (User Defined Function) - Spark by {Examples}

PySpark UDF (a.k.a User Defined Function) is the most useful feature of Spark SQL & DataFrame that is used to extend the PySpark...

pyspark.sql.functions.udf - Apache Spark

Creates a user defined function (UDF). New in version 1.3.0. Parameters: ffunction ... The user-defined functions are considered deterministic by default.

Coverage for pyspark/sql/udf.py: 74% - GitHub Pages

User -defined function related classes and functions ... Use :meth:`pyspark.sql.functions.udf` or :meth:`pyspark.sql.functions.pandas_udf`.

How to Write Spark UDFs (User Defined Functions) in Python

In this article, I'll explain how to write user defined functions (UDF) in Python for Apache Spark. The code for this example is...

UDFs in Pyspark - Medium

Pyspark sql library itself provides a wide variety of functions to apply on a data frame, but we can define our own functions...