How to validate schema with nested Pyspark ArrayType/StructType using Pandera?
See original GitHub issueI’d like to do schema validation on a Pyspark dataframe with an existing schema
# nested data structure
structureData = [
([("James","","Smith")],"36636","M",3100),
([("Michael","Rose","")],"40288","M",4300),
([("Robert","","Williams")],"42114","M",1400),
([("Maria","Anne","Jones")],"39192","F",5500),
([("Jen","Mary","Brown")],"","F",-1)
]
# nested name fields
structureSchema = StructType([
StructField('name', ArrayType(StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
]))),
StructField('id', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', IntegerType(), True)
])
How would I go about specifying the schema in Pandera to match the above Pyspark schema?
import pandera as pa
from pandera import Column, DataFrameSchema
base_schema = pa.DataFrameSchema({
"name": ??? <-- HOW TO SPECIFY MATCHING SCHEMA?
"id": pa.Column(str),
"gender": pa.Column(str),
"salary": pa.Column(int),
})
Issue Analytics
- State:
- Created a year ago
- Comments:9
Top Results From Across the Web
PySpark StructType & StructField Explained with Examples
PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested.
Read more >Defining PySpark Schemas with StructType and StructField
Use the printSchema() method to verify that the DataFrame has the exact schema we specified. df.printSchema() root |-- name: string (nullable = ...
Read more >Validating Schema of Column with StructType in Pyspark 2.4
The real data has many, many keys, some of them nested so checking each one with some form of isNan isn't feasable and...
Read more >Nested Data Types in Spark 3.1. Working with structs in Spark ...
Struct. The StructType is a very important data type that allows representing nested hierarchical data. It can be used to group some fields ......
Read more >Flattening and renaming Spark Dataframe having a complex ...
A Spark DataFrame can have a simple schema, where every single column is of a simple datatype like ... StructType nested in ArrayType....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Considering Pandera doesn’t have support for PySpark DataFrames (at least for now), you can use Fugue and Pandera like this to validate each partition of the DataFrame as a Pandas DataFrame. Fugue handles the data type conversions for you.
@goodwanghan thanks for the help and code snippets, am able to validate the nested spark dataframe like I wanted. Works good after I upgraded fugue to the 0.6.6.dev3 version