Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Read Parquet directly into string[pyarrow]

See original GitHub issue

I would like to read parquet data directly into string pyarrow dtypes

So I’m trying this naively on a dataset:

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet", 
    split_row_groups=True, 
    use_nullable_dtypes=True,
)
df.dtypes

hvfhs_license_num               object
dispatching_base_num            object
originating_base_num            object
request_datetime        datetime64[ns]
on_scene_datetime       datetime64[ns]
pickup_datetime         datetime64[ns]
dropoff_datetime        datetime64[ns]
PULocationID                     int64
DOLocationID                     int64
trip_miles                     float64
trip_time                        int64
base_passenger_fare            float64
tolls                          float64
bcf                            float64
sales_tax                      float64
congestion_surcharge           float64
airport_fee                    float64
tips                           float64
driver_pay                     float64
shared_request_flag             object
shared_match_flag               object
access_a_ride_flag              object
wav_request_flag                object
wav_match_flag                  object
dtype: object

This is especially important to me because in this case one row group is 10GB when stored naively. The data is somewhat unreadable in its current state on modest machines. I also suspect that I’m spending almost all of my time just creating and then destroying Python objects.

Any thoughts @rjzamora @ian-r-rose ?

Issue Analytics

State:
Created a year ago
Comments:30 (23 by maintainers)

Top GitHub Comments

1reaction

mroeschkecommented, Dec 13, 2022

Awesome! I put it on the meeting agenda so would be great to hear a representative pitch from the Dask side

1reaction

mrocklincommented, Nov 7, 2022

Oh cool. And memory usage is lower in that case too.

OK, I guess then this becomes more a question of how Dask could be leveraging this as well

Top Results From Across the Web

Reading and Writing the Apache Parquet Format

If you are building pyarrow from source, you must use -DARROW_PARQUET=ON when compiling the C++ libraries and enable the Parquet extensions when building ......

pyarrow.parquet.read_table — Apache Arrow v10.0.1

List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. To read a flat...

Reading and Writing Data — Apache Arrow Python Cookbook ...

To write it to a Parquet file, as Parquet is a format that contains multiple named columns, we must create a pyarrow.Table out...

pyarrow.parquet.ParquetDataset — Apache Arrow v10.0.1

Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. ... Rows which do not match...

pyarrow.parquet.read_schema — Apache Arrow v10.0.1

Read effective Arrow schema from Parquet file metadata. Parameters: where str (file path) or file-like object: memory_map ...