Writing map datatype
See original GitHub issueI’m trying to use fastparquet to integrate a Python service into an existing Spark (pySpark) application. I need to be able to read/write parquet files from both fastparquet and Spark.
I’m having trouble writing a map-type column from fastparquet (I’m able to read them no problem).
import fastparquet
# Read in parquet file generate by pySpark.
pf = fastparquet.ParquetFile("./subset_part-r-02572-bddb92a4-e16d-4e9b-b0fd-4454c0c28525.snappy.parquet")
df = pf.to_pandas(timestamp96=['timestamp'])
assert df.loc[3].verticals == {u'401': u'1.0', u'58': u'0.5027621388435364'}, "Sanity check"
assert df.loc[3].verticals['58'] == u'0.5027621388435364', "Sanity check"
# Attempt to round trip the dataframe back out to Parquet in the same format as the input file.
fastparquet.write('write.parquet', df, compression='SNAPPY', file_scheme='hive', has_nulls=True, times='int96')
Now using parquet-tools to inspect the output.
# Input file generated with pySpark
$ parquet-tools schema subset_part-r-02572-bddb92a4-e16d-4e9b-b0fd-4454c0c28525.snappy.parquet
message spark_schema {
required binary activity_id (UTF8);
required int96 timestamp;
optional binary uid (UTF8);
optional group verticals (MAP) {
repeated group key_value {
required binary key (UTF8);
required binary value (UTF8);
}
}
optional binary url (UTF8);
optional binary domain (UTF8);
optional binary sld (UTF8);
optional binary zip_code (UTF8);
}
fastparquet is writing the verticals
column as type JSON
instead of type MAP
.
# Output file generate by fastparquet
$ parquet-tools schema write.parquet
message schema {
optional binary activity_id (UTF8);
optional int96 timestamp;
optional binary uid (UTF8);
optional binary verticals (JSON);
optional binary url (UTF8);
optional binary domain (UTF8);
optional binary sld (UTF8);
optional binary zip_code (UTF8);
}
Here is the schema used by pySpark
from pyspark.sql.types import *
schema = StructType([
StructField("activity_id", StringType(), False),
StructField("timestamp", TimestampType(), False),
StructField("date", StringType(), False),
StructField("uid", StringType(), True),
StructField("verticals", MapType(StringType(), StringType(), False), True),
StructField("url", StringType(), True),
StructField("domain", StringType(), True),
StructField("sld", StringType(), True),
StructField("zip_code", StringType(), True),
])
Also, how do you make fastparquet mark a column as “required” instead of “optional” in the schema?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Map - JavaScript - MDN Web Docs - Mozilla
The Map object holds key-value pairs and remembers the original insertion order of the keys. Any value (both objects and primitive values) ...
Read more >Map and Set - The Modern JavaScript Tutorial
Map. Map is a collection of keyed data items, just like an Object . But the main difference is that Map allows keys...
Read more >Understanding Map and Set Objects in JavaScript | DigitalOcean
A Map is a collection of key/value pairs that can use any data type as a key and can maintain the order of...
Read more >Map Abstract Data Type - GeeksforGeeks
The map is an abstract data type that contains a collection of records. It is an interface, that states what all operations can...
Read more >Map Data Type - Informatica Documentation
A map data type represents an unordered collection of key-value pair elements. A map element is a key and value pair that maps...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’ll consider it, but my initial thought is that it would be a lot of effort to implement writing MAP/LIST types for what seems like a niche use-case. I am surprised that Athena can’t make use of the columns as they are.
@yohplala , this is mentioned in the (old) comment stream above too, but the OP did not want to resort to JSON encoding.