PERF: Consider whether to avoid partition.length() in the parquet dispatcher.
See original GitHub issuePer @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:
- Even if we have already materialized the index objects in
build_index
,ray.get
for the already computed size may be expensive (we should check this) - If we haven’t materialized the index in
build_index
, thelength()
call may be unnecessarily blocking (maybe something else will block anyway, though?)
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Top 10 Performance Tuning Tips for Amazon Athena
We discuss the following best practices: Partition your data; Bucket your data; Use compression; Optimize file size; Optimize columnar data ...
Read more >Parquet Files - Spark 2.4.0 Documentation
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing...
Read more >Query partitioned tables | BigQuery - Google Cloud
This document describes some specific considerations for querying partitioned tables in BigQuery. For general information on running queries in BigQuery, ...
Read more >apache spark - How do you control the size of the output file?
It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before...
Read more >Apache Spark Performance Boosting | by Halil Ertan
Spark knows to avoid a shuffle when a previous transformation has ... As a rule of thumb, if each partition of the first...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
can you give an example? im guessing you’re referring to pandas.DatetimeTZDtype?
@jbrockmendel I don’t have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I’ve had some problems before, but maybe the type mappings are better now between Arrow and pandas.