Memory overflow using dict type features
See original GitHub issueI am trying to use Word vector features for training my CRF model for Named entity recognition.I am using sklear-crfsuite for training my model which follows the same convention for creating features as this library and is a wrapper around python-crfsuite.
The word vector features result in 300 additional dense features per token in the sequence. Because python-crfsuite uses dictionaries to specify the features, my training data itself ends up taking 6 times more memory compared to that used by CRFsuite native binary with the training data in text files. (35 GB vs 6.5 GB).
Is there a more memory efficient way to specify the features instead of only using dicts. I believe the ItemSequence
is a way to pass Sequences instead of using the list of dictionaries. Can I also pass generators for my training features instead of the list of training features, to keep the memory overhead low ?
I know sklearn-crfsuite
uses the six.moves.zip
function which returns an iterator instead of a list of zipped X,y
, so I believe it should be possible.
Can there be an example for using the ItemSequence
class for giving the input data?
Issue Analytics
- State:
- Created 7 years ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
@tpeng, @kmike and @foxinfotech I have made a pull request with an API change which allows for easy addition of word embedding or other float list features. Do have a look and let me know if this is compatible with the overall API of python-crfsuite.
I have created a separate issue for the word embedding feature at #39. Please continue the discussion on it at that post. I am closing this issue as it is resolved.