Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Joblib pickling performance for pandas DataFrame

See original GitHub issue

Below are a few fairly limited benchmarks, that illustrate the serialization / de-serialization performance of pandas.DataFrame with joblib,

import pandas as pd
from sklearn.externals import joblib
import pickle

df = pd.read_csv('http://www.gagolewski.com/resources/data/titanic3.csv',
                 comment='#')
print(df.shape)

# writing to tmpfs to ignore disk I/O cost
%timeit joblib.dump(df, '/dev/shm/df_2.pkl')
%timeit df.to_pickle('/dev/shm/df_1.pkl')
%timeit with open('/dev/shm/df_3.pkl', 'wb') as fh: pickle.dump(df, fh)

produces,

3.17 ms ± 275 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.22 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
855 µs ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

While for read access,

%timeit joblib.load('/dev/shm/df_2.pkl')
%timeit pd.read_pickle('/dev/shm/df_1.pkl')
%timeit with open('/dev/shm/df_3.pkl', 'rb') as fh: pickle.load(fh)

produces,

3.47 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.4 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.11 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Run on Linux with Python 3.6 and joblib 0.11. I can run more careful / extensive benchmarks if needed.

So in this particular case, it appears that joblib serialization is almost 3x slower. Since pandas has has a pickle serialization function, I’m wondering how hard would be to use it in joblib.

This is in particular relevant when serializing complex python objects (e.g. scikit-learn pipelines) that can contain numpy arrays and DataFrames. cc @lesteve @jorisvandenbossche

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:11 (6 by maintainers)

Top GitHub Comments

5reactions

RoyalTScommented, Nov 17, 2020

Writing pandas/numpy arrays to parquet sounds like an extremely useful piece of functionality. Any chance of cleaning this up and getting it into joblib proper?

2reactions

cottrellcommented, Apr 3, 2020

Why not just use parquet or better yet pyarrow record batch? I can add this in pretty quickly if someone points to where this should go. I think I have a custom backend somewhere that does this but it really should be the default.

I tend to inspect outputs of functions and allow for depth-0 or depth-1. So return pd.DataFrame is detected or return list of dataframes or return dict of dataframes etc.

If this dict_of_things serialization was in joblib I would probably start using it for most small projects.

Would like to get some positive feedback from the joblib comptrollers before doing any work though.

And as of 2020, pyarrow record batches are fastest deserializations for pandas dataframes.

Top Results From Across the Web

What are the different use cases of joblib versus pickle?

joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure ......

Embarrassingly parallel for loops - Joblib - Read the Docs

It is therefore advised to always measure the speed of thread-based parallelism and use it when the scalability is not limited by the...

Is it Better to Save Models Using Joblib or Pickle? - Medium

TLDR: joblib is faster in saving/loading large NumPy arrays, whereas pickle is faster with large collections of Python objects. Therefore, if your model ......

joblib Documentation - Read the Docs

3) Fast compressed Persistence: a replacement for pickle to work efficiently on Python objects containing large data ( joblib.dump ...

A Parallel loop in Python with Joblib.Parallel

from joblib import Parallel, delayed from numba import jit import numpy ... is_prime_numba to all the array, returning a Pandas dataframe:.