MISC. Comparing fastparquet writing speed to vaex writing speed.
See original GitHub issueHi there,
I don’t mean to troll. I am testing out vaex, and am at a point of how damn writing a parquet file while benefiting fastparquet features (like a certain append=overwrite
) with data coming from vaex DataFrame?
I enclose here a comparison of writing times, either using parquet writing ensured by vaex (arrow under hood), or by fastparquet.
Comparison is not millimetrically exactly the same, but fastparquet appears to present in this case a slight speedup.
(I have used in the functions the same logic for directory creation and removal to have equal time spent in these tasks when using %timeit
)
I thought this comparison could be of interest for other people maybe? Will close the ticket in some days.
import vaex as vx
import pandas as pd
import fastparquet as fp
from os import path as os_path
import os
import shutil
# Writing data.
file_v=os_path.expanduser('~/Documents/code/data/vaex/test_v')
file_f=os_path.expanduser('~/Documents/code/data/vaex/test_f')
# test data
n_val=600000
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='1T')
df = pd.DataFrame({'val': range(n_val), 'vol': range(n_val, 2*n_val),
'timestamp': ts})
vdf = vx.from_pandas(df)
# vaex
def write_v(vdf):
try:
shutil.rmtree(file_v)
except:
pass
try:
vdf.export_many(os_path.join(file_v, 'output_chunk-{i:06}.parquet'),
chunk_size=100_000)
except FileNotFoundError:
os.mkdir(file_v)
vdf.export_many(os_path.join(file_v, 'output_chunk-{i:06}.parquet'),
chunk_size=100_000)
# fastparquet
def write_f(vdf):
try:
shutil.rmtree(file_f)
except:
pass
gen_df = vdf.to_pandas_df(chunk_size=100_000)
for _, _, chunk in gen_df:
try:
fp.write(file_f, chunk, file_scheme='hive', append=True)
except FileNotFoundError:
fp.write(file_f, chunk, file_scheme='hive', append=False)
# test
# 118 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit write_v(vdf)
# 91.6 ms ± 5.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit write_f(vdf)
# read with fastparquet & compare
df_v = fp.ParquetFile(file_v).to_pandas()
df_f = fp.ParquetFile(file_f).to_pandas()
assert df_v.equals(df)
assert df_f.equals(df)
Bests
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Usage Notes — fastparquet 0.7.1 documentation
Fastparquet will automatically use metadata information to load such columns as categorical if the data was written by fastparquet/pyarrow. To efficiently load ...
Read more >Dask vs Vaex - a qualitative comparison
We are often asked how do Dask and Vaex compare. ... be read and processed with incredible speed - all on a single...
Read more >Jcharis/DataScienceTools: Useful Data Science and ... - GitHub
Speed and Large Dataset. Pandas Modin; Dask; Pyarrow; Fastparquet; vaex: https://github.com/maartenbreddels/vaex ... Misc. Sending Windows 10 notifications: ...
Read more >fastparquet Documentation - Read the Docs
1. read and write Parquet files, in single or multiple-file format ... Micro improvements to the speed of ParquetFile creation by using ...
Read more >a quantity that can be divided into another a whole number of ...
Code can only ever be fast if it's written with the contraints of computer hardware in mind, and idiomatic ... How vectorization speeds...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It is totally fine to make comparisons, but benchmarks are hard, which may be why I consistently see fastparquet outperform everything else 😃
@martindurant I am sorry, my previous post was inappropriate here and I removed it. I appreciate your proposal to help Martin. It is possible I am doing mistake, but I wish to assess further the track I am currently visiting.