Joblib pickling performance for pandas DataFrame
See original GitHub issueRelated to https://github.com/joblib/joblib/issues/467
Below are a few fairly limited benchmarks, that illustrate the serialization / de-serialization performance of pandas.DataFrame
with joblib,
import pandas as pd
from sklearn.externals import joblib
import pickle
df = pd.read_csv('http://www.gagolewski.com/resources/data/titanic3.csv',
comment='#')
print(df.shape)
# writing to tmpfs to ignore disk I/O cost
%timeit joblib.dump(df, '/dev/shm/df_2.pkl')
%timeit df.to_pickle('/dev/shm/df_1.pkl')
%timeit with open('/dev/shm/df_3.pkl', 'wb') as fh: pickle.dump(df, fh)
produces,
3.17 ms ± 275 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.22 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
855 µs ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While for read access,
%timeit joblib.load('/dev/shm/df_2.pkl')
%timeit pd.read_pickle('/dev/shm/df_1.pkl')
%timeit with open('/dev/shm/df_3.pkl', 'rb') as fh: pickle.load(fh)
produces,
3.47 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.4 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.11 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run on Linux with Python 3.6 and joblib 0.11. I can run more careful / extensive benchmarks if needed.
So in this particular case, it appears that joblib serialization is almost 3x slower. Since pandas has has a pickle serialization function, I’m wondering how hard would be to use it in joblib.
This is in particular relevant when serializing complex python objects (e.g. scikit-learn pipelines) that can contain numpy arrays and DataFrames. cc @lesteve @jorisvandenbossche
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:11 (6 by maintainers)
Top Results From Across the Web
What are the different use cases of joblib versus pickle?
joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure ......
Read more >Embarrassingly parallel for loops - Joblib - Read the Docs
It is therefore advised to always measure the speed of thread-based parallelism and use it when the scalability is not limited by the...
Read more >Is it Better to Save Models Using Joblib or Pickle? - Medium
TLDR: joblib is faster in saving/loading large NumPy arrays, whereas pickle is faster with large collections of Python objects. Therefore, if your model ......
Read more >joblib Documentation - Read the Docs
3) Fast compressed Persistence: a replacement for pickle to work efficiently on Python objects containing large data ( joblib.dump ...
Read more >A Parallel loop in Python with Joblib.Parallel
from joblib import Parallel, delayed from numba import jit import numpy ... is_prime_numba to all the array, returning a Pandas dataframe:.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Writing pandas/numpy arrays to parquet sounds like an extremely useful piece of functionality. Any chance of cleaning this up and getting it into joblib proper?
Why not just use parquet or better yet pyarrow record batch? I can add this in pretty quickly if someone points to where this should go. I think I have a custom backend somewhere that does this but it really should be the default.
I tend to inspect outputs of functions and allow for depth-0 or depth-1. So return pd.DataFrame is detected or return list of dataframes or return dict of dataframes etc.
If this dict_of_things serialization was in joblib I would probably start using it for most small projects.
Would like to get some positive feedback from the joblib comptrollers before doing any work though.
And as of 2020, pyarrow record batches are fastest deserializations for pandas dataframes.