Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory overflow using dict type features

See original GitHub issue

I am trying to use Word vector features for training my CRF model for Named entity recognition.I am using sklear-crfsuite for training my model which follows the same convention for creating features as this library and is a wrapper around python-crfsuite.

The word vector features result in 300 additional dense features per token in the sequence. Because python-crfsuite uses dictionaries to specify the features, my training data itself ends up taking 6 times more memory compared to that used by CRFsuite native binary with the training data in text files. (35 GB vs 6.5 GB).

Is there a more memory efficient way to specify the features instead of only using dicts. I believe the ItemSequence is a way to pass Sequences instead of using the list of dictionaries. Can I also pass generators for my training features instead of the list of training features, to keep the memory overhead low ?

I know sklearn-crfsuite uses the six.moves.zip function which returns an iterator instead of a list of zipped X,y, so I believe it should be possible.

Can there be an example for using the ItemSequence class for giving the input data?

Issue Analytics

State:
Created 7 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

4reactions

napsternxgcommented, Jun 10, 2016

@tpeng, @kmike and @foxinfotech I have made a pull request with an API change which allows for easy addition of word embedding or other float list features. Do have a look and let me know if this is compatible with the overall API of python-crfsuite.

0reactions

napsternxgcommented, Jun 10, 2016

I have created a separate issue for the word embedding feature at #39. Please continue the discussion on it at that post. I am closing this issue as it is resolved.