Reading parquet using smart_open+pandas is 3x slower than pandas
See original GitHub issueProblem description
Reading a parquet file from S3 with smart_open
+ pandas
+ pyarrow
is seriously slower (3x) than if using just pandas
+ pyarrow
. I independently tried optimizing buffering
and buffer_size
with no luck.
Steps/code to reproduce the problem
import datetime
import timeit
import boto3
import pandas as pd
import pyarrow
import s3path
import smart_open
PARQUET_URI_IN = "s3://PLEASE-USE-YOUR/OWN/FILE.parquet" # CUSTOMIZE! File size must be at least a few MiB.
BOTO3_VER = f"boto3=={boto3.__version__}"
PANDAS_VER = f"pandas=={pd.__version__}"
PYARROW_VER = f"pyarrow=={pyarrow.__version__}"
SMART_OPEN_VER = f"smart_open=={smart_open.__version__}"
class Timer:
"""Measure time used."""
# Ref: https://stackoverflow.com/a/57931660/
def __init__(self, round_n_digits: int = 0):
self._round_n_digits = round_n_digits
self._start_time = timeit.default_timer()
def __call__(self) -> float:
return timeit.default_timer() - self._start_time
def __str__(self) -> str:
return str(datetime.timedelta(seconds=round(self(), self._round_n_digits)))
# Warmup using boto:
path = s3path.S3Path.from_uri(PARQUET_URI_IN)
timer = Timer()
boto3.client("s3").get_object(Bucket=str(path.bucket)[1:], Key=str(path.key))["Body"].read()
print(f"Warmed up a parquet file from S3 using {BOTO3_VER} in {timer}.")
# Read without smart_open:
timer = Timer()
df = pd.read_parquet(PARQUET_URI_IN, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")
# Read with smart_open:
timer = Timer()
with smart_open.open(PARQUET_URI_IN, "rb") as file:
df = pd.read_parquet(file, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {SMART_OPEN_VER} w/ {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
macOS-10.15.3-x86_64-i386-64bit
Python 3.8.4 | packaged by conda-forge | (default, Jul 17 2020, 14:54:34)
[Clang 10.0.0 ]
smart_open 2.1.0
Output
Trial 1:
Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:03.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:06.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:18.
Trial 2:
Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:02.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:05.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:16.
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Why does Dask read parquet file in a lot slower than ...
My first guess is that Pandas saves Parquet datasets into a single row group, which won't allow a system like Dask to parallelize....
Read more >How fast is reading Parquet file (with Arrow) vs. CSV ...
In this article, we will show that using Parquet files with Apache Arrow gives you an impressive speed advantage compared to using CSV...
Read more >Faster Data Loading for Pandas on S3
First, the Pandas load times from data already in memory and from local files are the same, indicating the bottleneck is entirely CSV...
Read more >Python and Parquet Performance
In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and ......
Read more >Make pandas 60x faster. Parquet files for big data
To convert any large CSV file to Parquet format, we step through the CSV file and save each increment as a Parquet file....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, that’s what I meant. The issue appears only with smart_open and parquet, not with smart_open and csv (for example). That’s a strong clue.
We’ll look into this, thanks for the clear report. Although I cannot promise any timeline, we’re all quite busy. If you’re able to check what requests pandas is sending (vs smart_open) via boto3 yourself, that’d be great – nothing jumps to my mind immediately.
Enabling DEBUG level logs may or may not help, but I’ll leave this to the developers.