question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Story: Serializing datasets

See original GitHub issue

Hi Torchtext,

It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.

  1. Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn’t really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the __getstate__ of Vocab. Maybe this could be built in.

  2. Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It’s cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.

  3. It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.

Thanks guys. As always great work. Cheers! Sasha

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:18
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

5reactions
srushcommented, Nov 5, 2017

Hmm. My concern is that many NLP datasets are just too large to be kept in memory at all times. Particularly since python representations of dicts and string are large memory-wise. I don’t feel like torchtext acknowledges this. We can hack our own thing, but I would prefer for the design of the library to reflect the following use case:

  1. Stream data to construct fields.
  2. Write data to disk in some compressed form.
  3. Stream data out again to convert to batches.
2reactions
mongoose54commented, Mar 9, 2018

Not sure if this relevant but MxNet has a nice binarization method for large data: https://mxnet.incubator.apache.org/faq/recordio.html The problem is that it is C/C++ but it might be helpful to convert it into python (?)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cutting Edge: Binary Serialization of DataSets | Microsoft Learn
The DataSet serializes to an XML DiffGram—a rich XML schema that contains the current snapshot of the data as well as pending errors...
Read more >
Serialize/Deserialize Large DataSet - Stack Overflow
The communication is done using WCF. The queried data, stored in a DataSet object, is very large and is usually round about 100mb...
Read more >
Serializing Objects to a DataSet in Visual Basic .NET
This article demonstrates how to serialize an object to a stream as XML and read the XML into an ADO.NET DataSet.
Read more >
Serializing a DataSet Object as XML - C# Corner
You can serialize a populated DataSet object to an XML file by executing the DataSet object's WriteXml method.
Read more >
What, Why and How of (De)Serialization in Python
Storing the state of an object in a file or database can save time to process huge datasets in many data science projects....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found