question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to validate schema with nested Pyspark ArrayType/StructType using Pandera?

See original GitHub issue

I’d like to do schema validation on a Pyspark dataframe with an existing schema

# nested data structure
structureData = [
    ([("James","","Smith")],"36636","M",3100),
    ([("Michael","Rose","")],"40288","M",4300),
    ([("Robert","","Williams")],"42114","M",1400),
    ([("Maria","Anne","Jones")],"39192","F",5500),
    ([("Jen","Mary","Brown")],"","F",-1)
  ]

# nested name fields
structureSchema = StructType([
        StructField('name', ArrayType(StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ]))),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

How would I go about specifying the schema in Pandera to match the above Pyspark schema?

import pandera as pa
from pandera import Column, DataFrameSchema

base_schema = pa.DataFrameSchema({
    "name": ??? <-- HOW TO SPECIFY MATCHING SCHEMA?
    "id": pa.Column(str),
    "gender": pa.Column(str),
    "salary": pa.Column(int),
})

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
kvnkhocommented, Apr 8, 2022

Considering Pandera doesn’t have support for PySpark DataFrames (at least for now), you can use Fugue and Pandera like this to validate each partition of the DataFrame as a Pandas DataFrame. Fugue handles the data type conversions for you.

0reactions
WilliamCVancommented, Apr 10, 2022

@goodwanghan thanks for the help and code snippets, am able to validate the nested spark dataframe like I wanted. Works good after I upgraded fugue to the 0.6.6.dev3 version

Read more comments on GitHub >

github_iconTop Results From Across the Web

PySpark StructType & StructField Explained with Examples
PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested.
Read more >
Defining PySpark Schemas with StructType and StructField
Use the printSchema() method to verify that the DataFrame has the exact schema we specified. df.printSchema() root |-- name: string (nullable = ...
Read more >
Validating Schema of Column with StructType in Pyspark 2.4
The real data has many, many keys, some of them nested so checking each one with some form of isNan isn't feasable and...
Read more >
Nested Data Types in Spark 3.1. Working with structs in Spark ...
Struct. The StructType is a very important data type that allows representing nested hierarchical data. It can be used to group some fields ......
Read more >
Flattening and renaming Spark Dataframe having a complex ...
A Spark DataFrame can have a simple schema, where every single column is of a simple datatype like ... StructType nested in ArrayType....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found