Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MISC. Comparing fastparquet writing speed to vaex writing speed.

See original GitHub issue

Hi there, I don’t mean to troll. I am testing out vaex, and am at a point of how damn writing a parquet file while benefiting fastparquet features (like a certain append=overwrite) with data coming from vaex DataFrame?

I enclose here a comparison of writing times, either using parquet writing ensured by vaex (arrow under hood), or by fastparquet. Comparison is not millimetrically exactly the same, but fastparquet appears to present in this case a slight speedup. (I have used in the functions the same logic for directory creation and removal to have equal time spent in these tasks when using %timeit)

I thought this comparison could be of interest for other people maybe? Will close the ticket in some days.

import vaex as vx
import pandas as pd
import fastparquet as fp
from os import path as os_path
import os
import shutil

# Writing data.
file_v=os_path.expanduser('~/Documents/code/data/vaex/test_v')
file_f=os_path.expanduser('~/Documents/code/data/vaex/test_f')

# test data
n_val=600000
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='1T')
df = pd.DataFrame({'val': range(n_val), 'vol': range(n_val, 2*n_val),
                   'timestamp': ts})
vdf = vx.from_pandas(df)

# vaex
def write_v(vdf):
    try:
        shutil.rmtree(file_v)
    except:
        pass
    try:
        vdf.export_many(os_path.join(file_v, 'output_chunk-{i:06}.parquet'),
                        chunk_size=100_000)
    except FileNotFoundError:
        os.mkdir(file_v)
        vdf.export_many(os_path.join(file_v, 'output_chunk-{i:06}.parquet'),
                        chunk_size=100_000)


# fastparquet
def write_f(vdf):
    try:
        shutil.rmtree(file_f)
    except:
        pass
    gen_df = vdf.to_pandas_df(chunk_size=100_000)    
    for _, _, chunk in gen_df:
        try:
            fp.write(file_f, chunk, file_scheme='hive', append=True)
        except FileNotFoundError:
            fp.write(file_f, chunk, file_scheme='hive', append=False)

# test
# 118 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit write_v(vdf)
# 91.6 ms ± 5.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit write_f(vdf)

# read with fastparquet & compare
df_v = fp.ParquetFile(file_v).to_pandas()
df_f = fp.ParquetFile(file_f).to_pandas()
assert df_v.equals(df)
assert df_f.equals(df)

Bests

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Aug 9, 2021

It is totally fine to make comparisons, but benchmarks are hard, which may be why I consistently see fastparquet outperform everything else 😃

0reactions

yohplalacommented, Aug 4, 2021

@martindurant I am sorry, my previous post was inappropriate here and I removed it. I appreciate your proposal to help Martin. It is possible I am doing mistake, but I wish to assess further the track I am currently visiting.

Top Results From Across the Web

Usage Notes — fastparquet 0.7.1 documentation

Fastparquet will automatically use metadata information to load such columns as categorical if the data was written by fastparquet/pyarrow. To efficiently load ...

Dask vs Vaex - a qualitative comparison

We are often asked how do Dask and Vaex compare. ... be read and processed with incredible speed - all on a single...

Jcharis/DataScienceTools: Useful Data Science and ... - GitHub

Speed and Large Dataset. Pandas Modin; Dask; Pyarrow; Fastparquet; vaex: https://github.com/maartenbreddels/vaex ... Misc. Sending Windows 10 notifications: ...

fastparquet Documentation - Read the Docs

1. read and write Parquet files, in single or multiple-file format ... Micro improvements to the speed of ParquetFile creation by using ...

a quantity that can be divided into another a whole number of ...

Code can only ever be fast if it's written with the contraints of computer hardware in mind, and idiomatic ... How vectorization speeds...