question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MISC. Comparing fastparquet writing speed to vaex writing speed.

See original GitHub issue

Hi there, I don’t mean to troll. I am testing out vaex, and am at a point of how damn writing a parquet file while benefiting fastparquet features (like a certain append=overwrite) with data coming from vaex DataFrame?

I enclose here a comparison of writing times, either using parquet writing ensured by vaex (arrow under hood), or by fastparquet. Comparison is not millimetrically exactly the same, but fastparquet appears to present in this case a slight speedup. (I have used in the functions the same logic for directory creation and removal to have equal time spent in these tasks when using %timeit)

I thought this comparison could be of interest for other people maybe? Will close the ticket in some days.

import vaex as vx
import pandas as pd
import fastparquet as fp
from os import path as os_path
import os
import shutil

# Writing data.
file_v=os_path.expanduser('~/Documents/code/data/vaex/test_v')
file_f=os_path.expanduser('~/Documents/code/data/vaex/test_f')

# test data
n_val=600000
ts = pd.date_range(start='2021/01/01 08:00', periods=n_val, freq='1T')
df = pd.DataFrame({'val': range(n_val), 'vol': range(n_val, 2*n_val),
                   'timestamp': ts})
vdf = vx.from_pandas(df)

# vaex
def write_v(vdf):
    try:
        shutil.rmtree(file_v)
    except:
        pass
    try:
        vdf.export_many(os_path.join(file_v, 'output_chunk-{i:06}.parquet'),
                        chunk_size=100_000)
    except FileNotFoundError:
        os.mkdir(file_v)
        vdf.export_many(os_path.join(file_v, 'output_chunk-{i:06}.parquet'),
                        chunk_size=100_000)


# fastparquet
def write_f(vdf):
    try:
        shutil.rmtree(file_f)
    except:
        pass
    gen_df = vdf.to_pandas_df(chunk_size=100_000)    
    for _, _, chunk in gen_df:
        try:
            fp.write(file_f, chunk, file_scheme='hive', append=True)
        except FileNotFoundError:
            fp.write(file_f, chunk, file_scheme='hive', append=False)

# test
# 118 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit write_v(vdf)
# 91.6 ms ± 5.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit write_f(vdf)

# read with fastparquet & compare
df_v = fp.ParquetFile(file_v).to_pandas()
df_f = fp.ParquetFile(file_f).to_pandas()
assert df_v.equals(df)
assert df_f.equals(df)

Bests

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Aug 9, 2021

It is totally fine to make comparisons, but benchmarks are hard, which may be why I consistently see fastparquet outperform everything else 😃

0reactions
yohplalacommented, Aug 4, 2021

@martindurant I am sorry, my previous post was inappropriate here and I removed it. I appreciate your proposal to help Martin. It is possible I am doing mistake, but I wish to assess further the track I am currently visiting.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Usage Notes — fastparquet 0.7.1 documentation
Fastparquet will automatically use metadata information to load such columns as categorical if the data was written by fastparquet/pyarrow. To efficiently load ...
Read more >
Dask vs Vaex - a qualitative comparison
We are often asked how do Dask and Vaex compare. ... be read and processed with incredible speed - all on a single...
Read more >
Jcharis/DataScienceTools: Useful Data Science and ... - GitHub
Speed and Large Dataset. Pandas Modin; Dask; Pyarrow; Fastparquet; vaex: https://github.com/maartenbreddels/vaex ... Misc. Sending Windows 10 notifications: ...
Read more >
fastparquet Documentation - Read the Docs
1. read and write Parquet files, in single or multiple-file format ... Micro improvements to the speed of ParquetFile creation by using ...
Read more >
a quantity that can be divided into another a whole number of ...
Code can only ever be fast if it's written with the contraints of computer hardware in mind, and idiomatic ... How vectorization speeds...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found