Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickling Tokenizers fails due to use of lambdas

See original GitHub issue

Description

Cannot pickle a CountVectorizer using the builtin python pickle module, likely due to the use of lambdas in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py

Steps/Code to Reproduce

Example:

import pickle
from sklearn.feature_extraction.text import CountVectorizer
raw_texts = ["this is a text", "oh look, here's another", "including my full model vocab is...well, a lot"]
vectorizer = CountVectorizer(max_features=20000, token_pattern=r"\b\w+\b")
vectorizer.fit(raw_texts)
tokenizer = vectorizer.build_tokenizer()
output_file = 'foo.pkl'
with open(output_file, 'wb') as out:
    pickle.dump(tokenizer, out)
with open(output_file, 'rb') as infile:
    pickle.load(infile)

Expected Results

Program runs without error

Actual Results

Traceback:

Traceback (most recent call last):
  File "tst.py", line 14, in <module>
    pickle.dump(tokenizer, out)
AttributeError: Can't pickle local object 'VectorizerMixin.build_tokenizer.<locals>.<lambda>'

Workaround:

Instead of the builtin pickle, use cloudpickle, which can capture the lambda expression.

Versions

Version information:

>>> import sklearn
>>> print(sklearn.show_versions())
/home/jay/Documents/projects/evidence-inference/venv/lib/python3.6/site-packages/numpy/distutils/system_info.py:625: UserWarning:
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
/usr/bin/ld: cannot find -lcblas
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcblas
collect2: error: ld returned 1 exit status

System:
    python: 3.6.5 (default, Apr  1 2018, 05:46:30)  [GCC 7.3.0]
executable: /home/jay/Documents/projects/evidence-inference/venv/bin/python
   machine: Linux-4.15.0-39-generic-x86_64-with-Ubuntu-18.04-bionic

BLAS:
    macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
  lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: cblas

Python deps:
       pip: 18.1
setuptools: 39.1.0
   sklearn: 0.20.2
     numpy: 1.15.1
     scipy: 1.1.0
    Cython: None
    pandas: 0.23.4
None

Similar Issues

I think this is similar to issues:

https://github.com/scikit-learn/scikit-learn/issues/10807
https://github.com/scikit-learn/scikit-learn/issues/9467 (looking at the stackoverflow thread at https://stackoverflow.com/questions/25348532/can-python-pickle-lambda-functions/25353243#25353243 , it suggests using dill which also seems to work for the toy example)

Proposed fix

Naively, I would make one of the two changes below, but I am not familiar with the scikit-learn codebase, so they might not be appropriate:

Update the FAQ to direct people to other serialization libraries (perhaps I missed this recommendation?), e.g. cloudpickle at https://github.com/cloudpipe/cloudpickle or dill
Remove the use of the lambdas in the vectorizer and replace them with locally def’d functions. I suspect that this solution is flawed because it doesn’t account for other uses of lambdas elsewhere in the codebase, and the only complete solution would be to stop using lambdas, but these are a useful language feature.