Pickling Tokenizers fails due to use of lambdas
See original GitHub issueDescription
Cannot pickle a CountVectorizer
using the builtin python pickle
module, likely due to the use of lambdas in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py
Steps/Code to Reproduce
Example:
import pickle
from sklearn.feature_extraction.text import CountVectorizer
raw_texts = ["this is a text", "oh look, here's another", "including my full model vocab is...well, a lot"]
vectorizer = CountVectorizer(max_features=20000, token_pattern=r"\b\w+\b")
vectorizer.fit(raw_texts)
tokenizer = vectorizer.build_tokenizer()
output_file = 'foo.pkl'
with open(output_file, 'wb') as out:
pickle.dump(tokenizer, out)
with open(output_file, 'rb') as infile:
pickle.load(infile)
Expected Results
Program runs without error
Actual Results
Traceback:
Traceback (most recent call last):
File "tst.py", line 14, in <module>
pickle.dump(tokenizer, out)
AttributeError: Can't pickle local object 'VectorizerMixin.build_tokenizer.<locals>.<lambda>'
Workaround:
Instead of the builtin pickle
, use cloudpickle
, which can capture the lambda
expression.
Versions
Version information:
>>> import sklearn
>>> print(sklearn.show_versions())
/home/jay/Documents/projects/evidence-inference/venv/lib/python3.6/site-packages/numpy/distutils/system_info.py:625: UserWarning:
Atlas (http://math-atlas.sourceforge.net/) libraries not found.
Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [atlas]) or by setting
the ATLAS environment variable.
self.calc_info()
/usr/bin/ld: cannot find -lcblas
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcblas
collect2: error: ld returned 1 exit status
System:
python: 3.6.5 (default, Apr 1 2018, 05:46:30) [GCC 7.3.0]
executable: /home/jay/Documents/projects/evidence-inference/venv/bin/python
machine: Linux-4.15.0-39-generic-x86_64-with-Ubuntu-18.04-bionic
BLAS:
macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: cblas
Python deps:
pip: 18.1
setuptools: 39.1.0
sklearn: 0.20.2
numpy: 1.15.1
scipy: 1.1.0
Cython: None
pandas: 0.23.4
None
Similar Issues
I think this is similar to issues:
- https://github.com/scikit-learn/scikit-learn/issues/10807
- https://github.com/scikit-learn/scikit-learn/issues/9467 (looking at the stackoverflow thread at https://stackoverflow.com/questions/25348532/can-python-pickle-lambda-functions/25353243#25353243 , it suggests using
dill
which also seems to work for the toy example)
Proposed fix
Naively, I would make one of the two changes below, but I am not familiar with the scikit-learn codebase, so they might not be appropriate:
- Update the FAQ to direct people to other serialization libraries (perhaps I missed this recommendation?), e.g.
cloudpickle
at https://github.com/cloudpipe/cloudpickle ordill
- Remove the use of the lambdas in the vectorizer and replace them with locally def’d functions. I suspect that this solution is flawed because it doesn’t account for other uses of lambdas elsewhere in the codebase, and the only complete solution would be to stop using lambdas, but these are a useful language feature.
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
python - PicklingError : Can't pickle <function> <lambda> at ...
You are using a lambda function in your CountVectorizer and lambdas can't be pickled, hence the pickling error.
Read more >Ways to Solve Can't Pickle local object Error - Python Pool
Now we are going to see one of the attribute errors namely can't pickle local object. Here we are will how this error...
Read more >Issue 19272: Can't pickle lambda (while named functions are ok)
Functions are pickled by name, not by code. Unpickling will only work if a function with the same name is present in in...
Read more >Methods to Perform Tokenization in Python - eduCBA
The tokenize() Function: When we need to tokenize a string, we use this function and we get a Python generator of token objects....
Read more >dask.base - Dask documentation
As such, they are primarily designed for use with the distributed scheduler. ... nblocks = (10, 10) >>> with dask.annotate(priority=lambda k: ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
+1 particularly that some of those are assigned to a named variable, which is not PEP8 compatible.
I made a PR that fixes the issue but I did not add a test case - where would be appropriate?