question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: Consider whether to avoid partition.length() in the parquet dispatcher.

See original GitHub issue

Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:

  • Even if we have already materialized the index objects in build_index, ray.get for the already computed size may be expensive (we should check this)
  • If we haven’t materialized the index in build_index, the length() call may be unnecessarily blocking (maybe something else will block anyway, though?)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jbrockmendelcommented, Aug 9, 2022

can you give an example? im guessing you’re referring to pandas.DatetimeTZDtype?

0reactions
pyritocommented, Aug 11, 2022

@jbrockmendel I don’t have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I’ve had some problems before, but maybe the type mappings are better now between Arrow and pandas.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Top 10 Performance Tuning Tips for Amazon Athena
We discuss the following best practices: Partition your data; Bucket your data; Use compression; Optimize file size; Optimize columnar data ...
Read more >
Parquet Files - Spark 2.4.0 Documentation
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing...
Read more >
Query partitioned tables | BigQuery - Google Cloud
This document describes some specific considerations for querying partitioned tables in BigQuery. For general information on running queries in BigQuery, ...
Read more >
apache spark - How do you control the size of the output file?
It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before...
Read more >
Apache Spark Performance Boosting | by Halil Ertan
Spark knows to avoid a shuffle when a previous transformation has ... As a rule of thumb, if each partition of the first...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found