Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Story: Serializing datasets

See original GitHub issue

Hi Torchtext,

It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.

Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn’t really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the __getstate__ of Vocab. Maybe this could be built in.
Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It’s cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.
It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.

Thanks guys. As always great work. Cheers! Sasha

Issue Analytics

State:
Created 6 years ago
Reactions:18
Comments:16 (8 by maintainers)

Top GitHub Comments

5reactions

srushcommented, Nov 5, 2017

Hmm. My concern is that many NLP datasets are just too large to be kept in memory at all times. Particularly since python representations of dicts and string are large memory-wise. I don’t feel like torchtext acknowledges this. We can hack our own thing, but I would prefer for the design of the library to reflect the following use case:

Stream data to construct fields.
Write data to disk in some compressed form.
Stream data out again to convert to batches.

2reactions

mongoose54commented, Mar 9, 2018

Not sure if this relevant but MxNet has a nice binarization method for large data: https://mxnet.incubator.apache.org/faq/recordio.html The problem is that it is C/C++ but it might be helpful to convert it into python (?)