question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Joblib pickling performance for pandas DataFrame

See original GitHub issue

Related to https://github.com/joblib/joblib/issues/467

Below are a few fairly limited benchmarks, that illustrate the serialization / de-serialization performance of pandas.DataFrame with joblib,

import pandas as pd
from sklearn.externals import joblib
import pickle

df = pd.read_csv('http://www.gagolewski.com/resources/data/titanic3.csv',
                 comment='#')
print(df.shape)

# writing to tmpfs to ignore disk I/O cost
%timeit joblib.dump(df, '/dev/shm/df_2.pkl')
%timeit df.to_pickle('/dev/shm/df_1.pkl')
%timeit with open('/dev/shm/df_3.pkl', 'wb') as fh: pickle.dump(df, fh)

produces,

3.17 ms ± 275 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.22 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
855 µs ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

While for read access,

%timeit joblib.load('/dev/shm/df_2.pkl')
%timeit pd.read_pickle('/dev/shm/df_1.pkl')
%timeit with open('/dev/shm/df_3.pkl', 'rb') as fh: pickle.load(fh)

produces,

3.47 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.4 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.11 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Run on Linux with Python 3.6 and joblib 0.11. I can run more careful / extensive benchmarks if needed.

So in this particular case, it appears that joblib serialization is almost 3x slower. Since pandas has has a pickle serialization function, I’m wondering how hard would be to use it in joblib.

This is in particular relevant when serializing complex python objects (e.g. scikit-learn pipelines) that can contain numpy arrays and DataFrames. cc @lesteve @jorisvandenbossche

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:2
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

5reactions
RoyalTScommented, Nov 17, 2020

Writing pandas/numpy arrays to parquet sounds like an extremely useful piece of functionality. Any chance of cleaning this up and getting it into joblib proper?

2reactions
cottrellcommented, Apr 3, 2020

Why not just use parquet or better yet pyarrow record batch? I can add this in pretty quickly if someone points to where this should go. I think I have a custom backend somewhere that does this but it really should be the default.

I tend to inspect outputs of functions and allow for depth-0 or depth-1. So return pd.DataFrame is detected or return list of dataframes or return dict of dataframes etc.

If this dict_of_things serialization was in joblib I would probably start using it for most small projects.

Would like to get some positive feedback from the joblib comptrollers before doing any work though.

And as of 2020, pyarrow record batches are fastest deserializations for pandas dataframes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What are the different use cases of joblib versus pickle?
joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure ......
Read more >
Embarrassingly parallel for loops - Joblib - Read the Docs
It is therefore advised to always measure the speed of thread-based parallelism and use it when the scalability is not limited by the...
Read more >
Is it Better to Save Models Using Joblib or Pickle? - Medium
TLDR: joblib is faster in saving/loading large NumPy arrays, whereas pickle is faster with large collections of Python objects. Therefore, if your model ......
Read more >
joblib Documentation - Read the Docs
3) Fast compressed Persistence: a replacement for pickle to work efficiently on Python objects containing large data ( joblib.dump ...
Read more >
A Parallel loop in Python with Joblib.Parallel
from joblib import Parallel, delayed from numba import jit import numpy ... is_prime_numba to all the array, returning a Pandas dataframe:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found