question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preprocessing module is very confusing

See original GitHub issue

❓ Questions and Help

There seems to be a bit of code smell cropping up with the #770 issue related to to_onnx and how example inputs are used. I have a few questions.

What is your question?

  1. How does the DataPipeline class relate to the preprocess module? The documentation for this class is very sparse.
  2. Why is it the responsibility of the Preprocessor to instantiate the data source when from_data_source("something custom") is called to create a data module? This flow is very unintuitive.
  3. ApplyToKeys operates on dictionaries, and example_input_array can be a tensor, tuple, or dict. Where and how is this handled? For example in to_onnx or summarize the example_input_array tensor is passed through the model, which works fine on just the model, but the preprocessing expects a dict for the most part since it is highly encouraged in the docs to use that function as everything in flash is supposed to be a dict until right before inference. This results in all kinds of strange errors.

Code

The fix for 3 seems to be to do this on my custom Task

    def _apply_batch_transfer_handler(self, batch: Any, device: Optional[torch.device] = None, dataloader_idx: Optional[int] = None) -> Any:
        if isinstance(batch, torch.Tensor):
            return super()._apply_batch_transfer_handler(batch={DefaultDataKeys.INPUT: batch}, device=device, dataloader_idx=dataloader_idx)[DefaultDataKeys.INPUT]
        else:
            return super()._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)

But I have no idea what kinds of side effects this would have and it feels like a strange fix to a fundamental problem.

What have you tried?

This issue sprang up when I implemented a custom preprocess class and spiraled from there. to_onnx and summarize (by extension, training) are broken due to this bug. The ImageClassificationPreprocess won’t work for me because my outputs are continuous,

What’s your environment?

  • OS: All
  • Packaging: conda/pip
  • Version lightning 1.4.7, flash 0.5.0, bolts 0.4.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
dlangermcommented, Sep 29, 2021
0reactions
stale[bot]commented, Nov 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Beginners guide to Machine Learning — Data Preprocessing
So how data preprocessing is done? ... Libraries are collections of modules that can be called and used. ... Confused again?
Read more >
Data Preprocessing In Python - Analytics Vidhya
In simple words, pre-processing refers to the transformations applied to your data before feeding it to the algorithm. In python, scikit-learn ...
Read more >
Data preprocessing using mean removal - Packt Subscription
Data can be preprocessed in many ways—standardization, scaling, normalization, binarization, and one-hot encoding are some examples of preprocessing techniques.
Read more >
Checkpoint: Preprocessing - Andy's Brain Book! - Read the Docs
The more you think about why the results of a preprocessing step look good or bad, the easier it will become to make...
Read more >
Difference between Experimental Preprocessing layers and ...
I'm pretty sure these will be the same layers with two import paths for backwards compatibility. · 1 · @philosofool Gotcha, so my...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found