Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading parquet using smart_open+pandas is 3x slower than pandas

See original GitHub issue

Problem description

Reading a parquet file from S3 with smart_open + pandas + pyarrow is seriously slower (3x) than if using just pandas + pyarrow. I independently tried optimizing buffering and buffer_size with no luck.

Steps/code to reproduce the problem

import datetime
import timeit

import boto3
import pandas as pd
import pyarrow
import s3path
import smart_open

PARQUET_URI_IN = "s3://PLEASE-USE-YOUR/OWN/FILE.parquet"  # CUSTOMIZE! File size must be at least a few MiB.

BOTO3_VER = f"boto3=={boto3.__version__}"
PANDAS_VER = f"pandas=={pd.__version__}"
PYARROW_VER = f"pyarrow=={pyarrow.__version__}"
SMART_OPEN_VER = f"smart_open=={smart_open.__version__}"


class Timer:
    """Measure time used."""

    # Ref: https://stackoverflow.com/a/57931660/
    def __init__(self, round_n_digits: int = 0):
        self._round_n_digits = round_n_digits
        self._start_time = timeit.default_timer()

    def __call__(self) -> float:
        return timeit.default_timer() - self._start_time

    def __str__(self) -> str:
        return str(datetime.timedelta(seconds=round(self(), self._round_n_digits)))


# Warmup using boto:
path = s3path.S3Path.from_uri(PARQUET_URI_IN)
timer = Timer()
boto3.client("s3").get_object(Bucket=str(path.bucket)[1:], Key=str(path.key))["Body"].read()
print(f"Warmed up a parquet file from S3 using {BOTO3_VER} in {timer}.")

# Read without smart_open:
timer = Timer()
df = pd.read_parquet(PARQUET_URI_IN, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")

# Read with smart_open:
timer = Timer()
with smart_open.open(PARQUET_URI_IN, "rb") as file:
    df = pd.read_parquet(file, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {SMART_OPEN_VER} w/ {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)

macOS-10.15.3-x86_64-i386-64bit
Python 3.8.4 | packaged by conda-forge | (default, Jul 17 2020, 14:54:34) 
[Clang 10.0.0 ]
smart_open 2.1.0

Output

Trial 1:

Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:03.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:06.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:18.

Trial 2:

Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:02.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:05.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:16.

Checklist

Before you create the issue, please make sure you have:

Described the problem clearly
Provided a minimal reproducible example, including any required data
Provided the version numbers of the relevant software

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7 (3 by maintainers)

Top GitHub Comments

3reactions

piskvorkycommented, Jul 24, 2020

Yes, that’s what I meant. The issue appears only with smart_open and parquet, not with smart_open and csv (for example). That’s a strong clue.

We’ll look into this, thanks for the clear report. Although I cannot promise any timeline, we’re all quite busy. If you’re able to check what requests pandas is sending (vs smart_open) via boto3 yourself, that’d be great – nothing jumps to my mind immediately.

0reactions

impredicativecommented, Jul 23, 2020

Enabling DEBUG level logs may or may not help, but I’ll leave this to the developers.