Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to validate schema with nested Pyspark ArrayType/StructType using Pandera?

See original GitHub issue

I’d like to do schema validation on a Pyspark dataframe with an existing schema

# nested data structure
structureData = [
    ([("James","","Smith")],"36636","M",3100),
    ([("Michael","Rose","")],"40288","M",4300),
    ([("Robert","","Williams")],"42114","M",1400),
    ([("Maria","Anne","Jones")],"39192","F",5500),
    ([("Jen","Mary","Brown")],"","F",-1)
  ]

# nested name fields
structureSchema = StructType([
        StructField('name', ArrayType(StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ]))),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

How would I go about specifying the schema in Pandera to match the above Pyspark schema?

import pandera as pa
from pandera import Column, DataFrameSchema

base_schema = pa.DataFrameSchema({
    "name": ??? <-- HOW TO SPECIFY MATCHING SCHEMA?
    "id": pa.Column(str),
    "gender": pa.Column(str),
    "salary": pa.Column(int),
})

Issue Analytics

State:
Created a year ago
Comments:9

Top GitHub Comments

1reaction

kvnkhocommented, Apr 8, 2022

Considering Pandera doesn’t have support for PySpark DataFrames (at least for now), you can use Fugue and Pandera like this to validate each partition of the DataFrame as a Pandas DataFrame. Fugue handles the data type conversions for you.

0reactions

WilliamCVancommented, Apr 10, 2022

@goodwanghan thanks for the help and code snippets, am able to validate the nested spark dataframe like I wanted. Works good after I upgraded fugue to the 0.6.6.dev3 version