Story: Serializing datasets
See original GitHub issueHi Torchtext,
It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.
-
Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn’t really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the
__getstate__
of Vocab. Maybe this could be built in. -
Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It’s cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.
-
It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.
Thanks guys. As always great work. Cheers! Sasha
Issue Analytics
- State:
- Created 6 years ago
- Reactions:18
- Comments:16 (8 by maintainers)
Top GitHub Comments
Hmm. My concern is that many NLP datasets are just too large to be kept in memory at all times. Particularly since python representations of dicts and string are large memory-wise. I don’t feel like torchtext acknowledges this. We can hack our own thing, but I would prefer for the design of the library to reflect the following use case:
Not sure if this relevant but MxNet has a nice binarization method for large data: https://mxnet.incubator.apache.org/faq/recordio.html The problem is that it is C/C++ but it might be helpful to convert it into python (?)