question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading parquet using smart_open+pandas is 3x slower than pandas

See original GitHub issue

Problem description

Reading a parquet file from S3 with smart_open + pandas + pyarrow is seriously slower (3x) than if using just pandas + pyarrow. I independently tried optimizing buffering and buffer_size with no luck.

Steps/code to reproduce the problem

import datetime
import timeit

import boto3
import pandas as pd
import pyarrow
import s3path
import smart_open

PARQUET_URI_IN = "s3://PLEASE-USE-YOUR/OWN/FILE.parquet"  # CUSTOMIZE! File size must be at least a few MiB.

BOTO3_VER = f"boto3=={boto3.__version__}"
PANDAS_VER = f"pandas=={pd.__version__}"
PYARROW_VER = f"pyarrow=={pyarrow.__version__}"
SMART_OPEN_VER = f"smart_open=={smart_open.__version__}"


class Timer:
    """Measure time used."""

    # Ref: https://stackoverflow.com/a/57931660/
    def __init__(self, round_n_digits: int = 0):
        self._round_n_digits = round_n_digits
        self._start_time = timeit.default_timer()

    def __call__(self) -> float:
        return timeit.default_timer() - self._start_time

    def __str__(self) -> str:
        return str(datetime.timedelta(seconds=round(self(), self._round_n_digits)))


# Warmup using boto:
path = s3path.S3Path.from_uri(PARQUET_URI_IN)
timer = Timer()
boto3.client("s3").get_object(Bucket=str(path.bucket)[1:], Key=str(path.key))["Body"].read()
print(f"Warmed up a parquet file from S3 using {BOTO3_VER} in {timer}.")

# Read without smart_open:
timer = Timer()
df = pd.read_parquet(PARQUET_URI_IN, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")

# Read with smart_open:
timer = Timer()
with smart_open.open(PARQUET_URI_IN, "rb") as file:
    df = pd.read_parquet(file, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {SMART_OPEN_VER} w/ {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
macOS-10.15.3-x86_64-i386-64bit
Python 3.8.4 | packaged by conda-forge | (default, Jul 17 2020, 14:54:34) 
[Clang 10.0.0 ]
smart_open 2.1.0

Output

Trial 1:

Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:03.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:06.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:18.

Trial 2:

Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:02.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:05.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:16.

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
piskvorkycommented, Jul 24, 2020

Yes, that’s what I meant. The issue appears only with smart_open and parquet, not with smart_open and csv (for example). That’s a strong clue.

We’ll look into this, thanks for the clear report. Although I cannot promise any timeline, we’re all quite busy. If you’re able to check what requests pandas is sending (vs smart_open) via boto3 yourself, that’d be great – nothing jumps to my mind immediately.

0reactions
impredicativecommented, Jul 23, 2020

Enabling DEBUG level logs may or may not help, but I’ll leave this to the developers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does Dask read parquet file in a lot slower than ...
My first guess is that Pandas saves Parquet datasets into a single row group, which won't allow a system like Dask to parallelize....
Read more >
How fast is reading Parquet file (with Arrow) vs. CSV ...
In this article, we will show that using Parquet files with Apache Arrow gives you an impressive speed advantage compared to using CSV...
Read more >
Faster Data Loading for Pandas on S3
First, the Pandas load times from data already in memory and from local files are the same, indicating the bottleneck is entirely CSV...
Read more >
Python and Parquet Performance
In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and ......
Read more >
Make pandas 60x faster. Parquet files for big data
To convert any large CSV file to Parquet format, we step through the CSV file and save each increment as a Parquet file....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found