Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Legacy interface for prototype datasets?

See original GitHub issue

The prototype datasets change the interface in two ways:

The input parameters are a little different in some cases. For example in the current API the MNIST dataset would be instantiated with datasets.MNIST(..., train=True) whereas now it looks like datasets.load("mnist", split="train").
The output is completely different. Before we returned a tuple (sometimes of varying length) whereas now we always return a dictionary. Furthermore, before we used PIL images and numpy arrays as return types, whereas now we always use tensor subclasses.

To lower the burden to move to the new style datasets a little, we could have a legacy: bool = False keyword argument on datasets.load(). For all datasets that have a legacy variant, we could simply implement two functions that map the input and output. If a dataset doesn’t support a legacy variant, we could simply error out.

cc @pmeier @bjuncek

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

2reactions

datumboxcommented, Dec 7, 2021

Yeah I think this is a critical detail that needs to be highlighted here. If what we get from the dataset is different, the user still needs to make a bunch of changes on their code to handle this. It’s not just the X part of the training data pair; it might be necessary handling the Y as well.

Given the above, I’m not sure if writing extra code to make things look different is worth it, given that the APIs need to be handled differently (due to their return types). Instead, one could argue that we should be putting enough good features in the new API, solving common user issues to make the migration worth it.

I would love to hear what others think on this. @NicolasHug @prabhat00155 @fmassa?

0reactions

NicolasHugcommented, Dec 7, 2021

Overall I share @datumbox points.

I don’t think it’s worth the hassle and extra complexity of maintaining 2 sets of API. If anything, I feel like this would actually hurt the migration, because some users would just do half of it instead of doing it all. and if/when we remove support for the legacy, they would have to apply a second set of change.

Instead, one could argue that we should be putting enough good features in the new API, solving common user issues to make the migration worth it

I was going to comment something along these lines before reading it

Top Results From Across the Web

[Datasets] Port `.to_torch()` to new `IterDataPipe` API. - GitHub

torchdata has a new IterDataPipe API that will subsume the old IterableDataset API, which is now considered the legacy Torch data interface.

Creating synthetic patient data to support the design ... - NCBI

We illustrate our approach by describing its use for a set of interface prototypes created in the design of a novel system to...

Prototype Datasets - NEON Data Portal

Open-source and open-development software for reproducible, extensible and portable data analysis includes the eddy4R family of R-packages underlying the EC ...

New Comtrade FAQ for Advanced Users - UN Statistics Division

What is the legacy of the UN Comtrade? ... Why are some converted datasets not accessible in the UI of the new Comtrade?...

Development — CASA Next Generation Infrastructure 0.1b ...

The CNGI Prototype application programming interface (API) is a set of flat, stateless functions that take an xarray Dataset as an input parameter...