question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory overflow using dict type features

See original GitHub issue

I am trying to use Word vector features for training my CRF model for Named entity recognition.I am using sklear-crfsuite for training my model which follows the same convention for creating features as this library and is a wrapper around python-crfsuite.

The word vector features result in 300 additional dense features per token in the sequence. Because python-crfsuite uses dictionaries to specify the features, my training data itself ends up taking 6 times more memory compared to that used by CRFsuite native binary with the training data in text files. (35 GB vs 6.5 GB).

Is there a more memory efficient way to specify the features instead of only using dicts. I believe the ItemSequence is a way to pass Sequences instead of using the list of dictionaries. Can I also pass generators for my training features instead of the list of training features, to keep the memory overhead low ?

I know sklearn-crfsuite uses the six.moves.zip function which returns an iterator instead of a list of zipped X,y, so I believe it should be possible.

Can there be an example for using the ItemSequence class for giving the input data?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
napsternxgcommented, Jun 10, 2016

@tpeng, @kmike and @foxinfotech I have made a pull request with an API change which allows for easy addition of word embedding or other float list features. Do have a look and let me know if this is compatible with the overall API of python-crfsuite.

0reactions
napsternxgcommented, Jun 10, 2016

I have created a separate issue for the word embedding feature at #39. Please continue the discussion on it at that post. I am closing this issue as it is resolved.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How allocation of memory for `dict` in Python works?
When you create an empty dictionary, it preallocates the memory in chunks for initial few references it can store. As the dictionary adds ......
Read more >
Python dicts and memory usage - Reuven Lerner
Let's say that we create a new, empty Python dictionary: >>> d = {}. How much memory does this new, empty dict consume?...
Read more >
Dictionary Memory Leak - JacksonDunstan.com
The Dictionary class provides perhaps the most useful support for weak references—and therefore garbage collection control—in the AS3 Flash API.
Read more >
What is a Buffer Overflow? How Do These Types of Attacks ...
Buffers contain a defined amount of data; any extra data will overwrite data values in memory addresses adjacent to the destination buffer. That...
Read more >
CWE-120: Buffer Copy without Checking Size of Input ... - MITRE
CWE-120: Buffer Copy without Checking Size of Input ('Classic Buffer Overflow') ; ParentOf, Variant - a weakness that is linked to a certain...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found