question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question on DataloaderCreator - How to create test sets

See original GitHub issue

Hello,

Well done on putting together this library I think it will be extremely useful for many people undertaking domain adaptation projects.

I am wondering how to create a test dataset using the DataloaderCreator class?

Some background on my issue.

I am using the MNISTM example within a PyTorch lightning data-module.

Adapting the code from the examples/DANNLightning.ipynb I have the following code.

class MnistAdaptDataModule(LightningDataModule):
    def __init__(
        self,
        data_dir: str = "data/mnistm/",
        batch_size: int = 4,
        num_workers: int = 0,
        pin_memory: bool = False,
    ):
        super().__init__()

        # this line allows to access init params with 'self.hparams' attribute
        # it also ensures init params will be stored in ckpt
        self.save_hyperparameters(logger=False)

        self.data_train: Optional[Dataset] = None
        self.data_val: Optional[Dataset] = None
        self.data_test: Optional[Dataset] = None
        self.dataloaders = None

    def prepare_data(self):
        if not os.path.exists(self.hparams.data_dir):
            print("downloading dataset")
            get_mnist_mnistm(["mnist"], ["mnistm"], folder=self.hparams.data_dir, download=True)
        return


    def setup(self, stage: Optional[str] = None):
        if not self.data_train and not self.data_val and not self.data_test:
            datasets = get_mnist_mnistm(["mnist"], ["mnistm"], folder=self.hparams.data_dir, download=False)
            dc = DataloaderCreator(batch_size=self.hparams.batch_size, num_workers=self.hparams.num_workers)
            validator = IMValidator()
            self.dataloaders = dc(**filter_datasets(datasets, validator))
            self.data_train = self.dataloaders.pop("train")
            self.data_val = list(self.dataloaders.values())
            return            

    def train_dataloader(self):
        return self.data_train

    def val_dataloader(self):
        return self.data_val

   def test_dataloader(self):
        # how to make a test dataset?
        return

self.dataloaders produces the following object

{'src_train': SourceDataset(
  domain=0
  (dataset): ConcatDataset(
    len=60000
    (datasets): [Dataset MNIST
        Number of datapoints: 60000
        Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
        Split: Train
        StandardTransform
    Transform: Compose(
                   Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
                   ToTensor()
                   <pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd1badcbdc0>
                   Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
               )]
  )
), 'src_val': SourceDataset(
  domain=0
  (dataset): ConcatDataset(
    len=10000
    (datasets): [Dataset MNIST
        Number of datapoints: 10000
        Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
        Split: Test
        StandardTransform
    Transform: Compose(
                   Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
                   ToTensor()
                   <pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd1badcb6a0>
                   Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
               )]
  )
), 'target_train': TargetDataset(
  domain=1
  (dataset): ConcatDataset(
    len=59001
    (datasets): [MNISTM(
      domain=MNISTM
      len=59001
      (transform): Compose(
          ToTensor()
          Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      )
    )]
  )
), 'target_val': TargetDataset(
  domain=1
  (dataset): ConcatDataset(
    len=9001
    (datasets): [MNISTM(
      domain=MNISTM
      len=9001
      (transform): Compose(
          ToTensor()
          Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      )
    )]
  )
), 'train': CombinedSourceAndTargetDataset(
  (source_dataset): SourceDataset(
    domain=0
    (dataset): ConcatDataset(
      len=60000
      (datasets): [Dataset MNIST
          Number of datapoints: 60000
          Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
          Split: Train
          StandardTransform
      Transform: Compose(
                     Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
                     ToTensor()
                     <pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd125f69d60>
                     Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
                 )]
    )
  )
  (target_dataset): TargetDataset(
    domain=1
    (dataset): ConcatDataset(
      len=59001
      (datasets): [MNISTM(
        domain=MNISTM
        len=59001
        (transform): Compose(
            ToTensor()
            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        )
      )]
    )
  )
)}

This handles train and val for source and target as well as creating a conjoined train dataset.

Going by the example ipynb, the concat dataset for train (of source and target) is used as the training dataset for the model.

The validation set is a list of the remaining keys in the data-loader and has the following form.

[
<torch.utils.data.dataloader.DataLoader object at 0x7fd1063e6b80> {
    dataset: TargetDataset(
  domain=1
  (dataset): ConcatDataset(
    len=59001
    (datasets): [MNISTM(
      domain=MNISTM
      len=59001
      (transform): Compose(
          ToTensor()
          Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      )
    )]
  )
)
}
]

I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.

In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?

Many thanks, Eoghan

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
KevinMusgravecommented, Apr 11, 2022

@deepseek-eoghan I updated the docs. Re: contributing to the docs, that would be very helpful, but I think example jupyter notebooks would be even better since I assume that’s where most people look first.

1reaction
KevinMusgravecommented, Apr 7, 2022

I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.

This is an unsolved problem in unsupervised domain adaptation. We want high accuracy on the unlabeled target domain, but since it is unlabeled, it is difficult to determine the model’s performance.

Whether or not we validate only on the target domain depends on the type of validator. The IMValidator uses only the target domain to compute a validation score, which is why the validation dataloaders returned by filter_datasets consists of only the target domain:

validator = IMValidator() # uses only the target domain to compute a validation score
self.dataloaders = dc(**filter_datasets(datasets, validator))

You could use a validator that adds source val accuracy plus the IM score:

from pytorch_adapt.validators import MultipleValidators, AccuracyValidator, IMValidator
validator = MultipleValidators([AccuracyValidator(), IMValidator()])
self.dataloaders = dc(**filter_datasets(datasets, validator))

Now the validation dataloaders should consist of the src_val set and the target_train set. Note that the target_train set is used for validation, because it is assumed that the target_val set is reserved for testing. (This is a bit confusing, however it’s the most realistic setting in my opinion. I can expand on this if you want.)

You can make the IMValidator use target_val instead of target_train like this:

validator = IMValidator(key_map={"target_val": "target_train"})

In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?

You can split the datasets however you want. As long as the DataloaderCreator recognizes the names of the splits you pass in:

dc = DataloaderCreator(val_names=["src_val", "target_val", "src_test", "target_test"])
dataloaders = dc(src_val=dataset1, target_val=dataset2, src_test=dataset3, target_test=dataset4)

Let me know if you have more questions!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to build a simple test data loader - Miquido Blog
Make the tester's work a little easier: in this article, we will learn how to build a simple test data loader step by...
Read more >
salesforce admin interview questions and answers 2022
salesforce data loader and data processing || salesforce admin interview questions and answers 2022 · Key moments. View all · Key moments ...
Read more >
Interview Questions on Data Loader and Import Wizard in ...
Ans: Open your data loader then Menu section Go to Settings–>scroll down and Set “Start at Row” field to start the process from...
Read more >
Configure Data Loader - Salesforce Help
Use the Settings menu to change the Data Loader default operation settings.Required Editions Available in: both Salesforce Classic (not available in all o....
Read more >
What is Salesforce Data Loader - A Beginners Guide - Intellipaat
How to open data Loader in Salesforce; Types of Data Loaders in Salesforce; Comparison between Types of Dataloaders; Data Types supported by Salesforce...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found