Read Parquet directly into string[pyarrow]
See original GitHub issueI would like to read parquet data directly into string pyarrow dtypes
So I’m trying this naively on a dataset:
import dask.dataframe as dd
df = dd.read_parquet(
"s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet",
split_row_groups=True,
use_nullable_dtypes=True,
)
df.dtypes
hvfhs_license_num object
dispatching_base_num object
originating_base_num object
request_datetime datetime64[ns]
on_scene_datetime datetime64[ns]
pickup_datetime datetime64[ns]
dropoff_datetime datetime64[ns]
PULocationID int64
DOLocationID int64
trip_miles float64
trip_time int64
base_passenger_fare float64
tolls float64
bcf float64
sales_tax float64
congestion_surcharge float64
airport_fee float64
tips float64
driver_pay float64
shared_request_flag object
shared_match_flag object
access_a_ride_flag object
wav_request_flag object
wav_match_flag object
dtype: object
This is especially important to me because in this case one row group is 10GB when stored naively. The data is somewhat unreadable in its current state on modest machines. I also suspect that I’m spending almost all of my time just creating and then destroying Python objects.
Any thoughts @rjzamora @ian-r-rose ?
Issue Analytics
- State:
- Created a year ago
- Comments:30 (23 by maintainers)
Top Results From Across the Web
Reading and Writing the Apache Parquet Format
If you are building pyarrow from source, you must use -DARROW_PARQUET=ON when compiling the C++ libraries and enable the Parquet extensions when building ......
Read more >pyarrow.parquet.read_table — Apache Arrow v10.0.1
List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. To read a flat...
Read more >Reading and Writing Data — Apache Arrow Python Cookbook ...
To write it to a Parquet file, as Parquet is a format that contains multiple named columns, we must create a pyarrow.Table out...
Read more >pyarrow.parquet.ParquetDataset — Apache Arrow v10.0.1
Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. ... Rows which do not match...
Read more >pyarrow.parquet.read_schema — Apache Arrow v10.0.1
Read effective Arrow schema from Parquet file metadata. Parameters: where str (file path) or file-like object: memory_map ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Awesome! I put it on the meeting agenda so would be great to hear a representative pitch from the Dask side
Oh cool. And memory usage is lower in that case too.
OK, I guess then this becomes more a question of how Dask could be leveraging this as well