question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing map datatype

See original GitHub issue

I’m trying to use fastparquet to integrate a Python service into an existing Spark (pySpark) application. I need to be able to read/write parquet files from both fastparquet and Spark.

I’m having trouble writing a map-type column from fastparquet (I’m able to read them no problem).

import fastparquet

# Read in parquet file generate by pySpark.
pf = fastparquet.ParquetFile("./subset_part-r-02572-bddb92a4-e16d-4e9b-b0fd-4454c0c28525.snappy.parquet")

df = pf.to_pandas(timestamp96=['timestamp'])
assert df.loc[3].verticals == {u'401': u'1.0', u'58': u'0.5027621388435364'}, "Sanity check"
assert df.loc[3].verticals['58'] == u'0.5027621388435364', "Sanity check"

# Attempt to round trip the dataframe back out to Parquet in the same format as the input file.
fastparquet.write('write.parquet', df, compression='SNAPPY', file_scheme='hive', has_nulls=True, times='int96')

Now using parquet-tools to inspect the output.

# Input file generated with pySpark

$ parquet-tools schema subset_part-r-02572-bddb92a4-e16d-4e9b-b0fd-4454c0c28525.snappy.parquet
message spark_schema {
  required binary activity_id (UTF8);
  required int96 timestamp;
  optional binary uid (UTF8);
  optional group verticals (MAP) {
    repeated group key_value {
      required binary key (UTF8);
      required binary value (UTF8);
    }
  }
  optional binary url (UTF8);
  optional binary domain (UTF8);
  optional binary sld (UTF8);
  optional binary zip_code (UTF8);
}

fastparquet is writing the verticals column as type JSON instead of type MAP.

# Output file generate by fastparquet

$ parquet-tools schema write.parquet
message schema {
  optional binary activity_id (UTF8);
  optional int96 timestamp;
  optional binary uid (UTF8);
  optional binary verticals (JSON);
  optional binary url (UTF8);
  optional binary domain (UTF8);
  optional binary sld (UTF8);
  optional binary zip_code (UTF8);
}

Here is the schema used by pySpark

from pyspark.sql.types import *
schema = StructType([
        StructField("activity_id", StringType(), False),
        StructField("timestamp", TimestampType(), False),
        StructField("date", StringType(), False),
        StructField("uid", StringType(), True),
        StructField("verticals", MapType(StringType(), StringType(), False), True),
        StructField("url", StringType(), True),
        StructField("domain", StringType(), True),
        StructField("sld", StringType(), True),
        StructField("zip_code", StringType(), True),
])

Also, how do you make fastparquet mark a column as “required” instead of “optional” in the schema?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:4
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
martindurantcommented, Jul 12, 2017

I’ll consider it, but my initial thought is that it would be a lot of effort to implement writing MAP/LIST types for what seems like a niche use-case. I am surprised that Athena can’t make use of the columns as they are.

1reaction
martindurantcommented, Jun 14, 2021

@yohplala , this is mentioned in the (old) comment stream above too, but the OP did not want to resort to JSON encoding.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Map - JavaScript - MDN Web Docs - Mozilla
The Map object holds key-value pairs and remembers the original insertion order of the keys. Any value (both objects and primitive values) ...
Read more >
Map and Set - The Modern JavaScript Tutorial
Map. Map is a collection of keyed data items, just like an Object . But the main difference is that Map allows keys...
Read more >
Understanding Map and Set Objects in JavaScript | DigitalOcean
A Map is a collection of key/value pairs that can use any data type as a key and can maintain the order of...
Read more >
Map Abstract Data Type - GeeksforGeeks
The map is an abstract data type that contains a collection of records. It is an interface, that states what all operations can...
Read more >
Map Data Type - Informatica Documentation
A map data type represents an unordered collection of key-value pair elements. A map element is a key and value pair that maps...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found