question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickling Tokenizers fails due to use of lambdas

See original GitHub issue

Description

Cannot pickle a CountVectorizer using the builtin python pickle module, likely due to the use of lambdas in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py

Steps/Code to Reproduce

Example:

import pickle
from sklearn.feature_extraction.text import CountVectorizer
raw_texts = ["this is a text", "oh look, here's another", "including my full model vocab is...well, a lot"]
vectorizer = CountVectorizer(max_features=20000, token_pattern=r"\b\w+\b")
vectorizer.fit(raw_texts)
tokenizer = vectorizer.build_tokenizer()
output_file = 'foo.pkl'
with open(output_file, 'wb') as out:
    pickle.dump(tokenizer, out)
with open(output_file, 'rb') as infile:
    pickle.load(infile)

Expected Results

Program runs without error

Actual Results

Traceback:

Traceback (most recent call last):
  File "tst.py", line 14, in <module>
    pickle.dump(tokenizer, out)
AttributeError: Can't pickle local object 'VectorizerMixin.build_tokenizer.<locals>.<lambda>'

Workaround:

Instead of the builtin pickle, use cloudpickle, which can capture the lambda expression.

Versions

Version information:

>>> import sklearn
>>> print(sklearn.show_versions())
/home/jay/Documents/projects/evidence-inference/venv/lib/python3.6/site-packages/numpy/distutils/system_info.py:625: UserWarning:
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
/usr/bin/ld: cannot find -lcblas
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcblas
collect2: error: ld returned 1 exit status

System:
    python: 3.6.5 (default, Apr  1 2018, 05:46:30)  [GCC 7.3.0]
executable: /home/jay/Documents/projects/evidence-inference/venv/bin/python
   machine: Linux-4.15.0-39-generic-x86_64-with-Ubuntu-18.04-bionic

BLAS:
    macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
  lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: cblas

Python deps:
       pip: 18.1
setuptools: 39.1.0
   sklearn: 0.20.2
     numpy: 1.15.1
     scipy: 1.1.0
    Cython: None
    pandas: 0.23.4
None

Similar Issues

I think this is similar to issues:

Proposed fix

Naively, I would make one of the two changes below, but I am not familiar with the scikit-learn codebase, so they might not be appropriate:

  1. Update the FAQ to direct people to other serialization libraries (perhaps I missed this recommendation?), e.g. cloudpickle at https://github.com/cloudpipe/cloudpickle or dill
  2. Remove the use of the lambdas in the vectorizer and replace them with locally def’d functions. I suspect that this solution is flawed because it doesn’t account for other uses of lambdas elsewhere in the codebase, and the only complete solution would be to stop using lambdas, but these are a useful language feature.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rthcommented, Dec 20, 2018

Remove the use of the lambdas in the vectorizer and replace them with locally def’d functions.

+1 particularly that some of those are assigned to a named variable, which is not PEP8 compatible.

0reactions
jaydedcommented, Dec 19, 2018

I made a PR that fixes the issue but I did not add a test case - where would be appropriate?

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - PicklingError : Can't pickle <function> <lambda> at ...
You are using a lambda function in your CountVectorizer and lambdas can't be pickled, hence the pickling error.
Read more >
Ways to Solve Can't Pickle local object Error - Python Pool
Now we are going to see one of the attribute errors namely can't pickle local object. Here we are will how this error...
Read more >
Issue 19272: Can't pickle lambda (while named functions are ok)
Functions are pickled by name, not by code. Unpickling will only work if a function with the same name is present in in...
Read more >
Methods to Perform Tokenization in Python - eduCBA
The tokenize() Function: When we need to tokenize a string, we use this function and we get a Python generator of token objects....
Read more >
dask.base - Dask documentation
As such, they are primarily designed for use with the distributed scheduler. ... nblocks = (10, 10) >>> with dask.annotate(priority=lambda k: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found