Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: Consider whether to avoid partition.length() in the parquet dispatcher.

See original GitHub issue

Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:

Even if we have already materialized the index objects in build_index, ray.get for the already computed size may be expensive (we should check this)
If we haven’t materialized the index in build_index, the length() call may be unnecessarily blocking (maybe something else will block anyway, though?)

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

jbrockmendelcommented, Aug 9, 2022

can you give an example? im guessing you’re referring to pandas.DatetimeTZDtype?

0reactions

pyritocommented, Aug 11, 2022

@jbrockmendel I don’t have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I’ve had some problems before, but maybe the type mappings are better now between Arrow and pandas.

Top Results From Across the Web

Top 10 Performance Tuning Tips for Amazon Athena

We discuss the following best practices: Partition your data; Bucket your data; Use compression; Optimize file size; Optimize columnar data ...

Parquet Files - Spark 2.4.0 Documentation

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing...

Query partitioned tables | BigQuery - Google Cloud

This document describes some specific considerations for querying partitioned tables in BigQuery. For general information on running queries in BigQuery, ...

apache spark - How do you control the size of the output file?

It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before...

Apache Spark Performance Boosting | by Halil Ertan

Spark knows to avoid a shuffle when a previous transformation has ... As a rule of thumb, if each partition of the first...