Question on DataloaderCreator - How to create test sets
See original GitHub issueHello,
Well done on putting together this library I think it will be extremely useful for many people undertaking domain adaptation projects.
I am wondering how to create a test dataset using the DataloaderCreator class?
Some background on my issue.
I am using the MNISTM example within a PyTorch lightning data-module.
Adapting the code from the examples/DANNLightning.ipynb I have the following code.
class MnistAdaptDataModule(LightningDataModule):
def __init__(
self,
data_dir: str = "data/mnistm/",
batch_size: int = 4,
num_workers: int = 0,
pin_memory: bool = False,
):
super().__init__()
# this line allows to access init params with 'self.hparams' attribute
# it also ensures init params will be stored in ckpt
self.save_hyperparameters(logger=False)
self.data_train: Optional[Dataset] = None
self.data_val: Optional[Dataset] = None
self.data_test: Optional[Dataset] = None
self.dataloaders = None
def prepare_data(self):
if not os.path.exists(self.hparams.data_dir):
print("downloading dataset")
get_mnist_mnistm(["mnist"], ["mnistm"], folder=self.hparams.data_dir, download=True)
return
def setup(self, stage: Optional[str] = None):
if not self.data_train and not self.data_val and not self.data_test:
datasets = get_mnist_mnistm(["mnist"], ["mnistm"], folder=self.hparams.data_dir, download=False)
dc = DataloaderCreator(batch_size=self.hparams.batch_size, num_workers=self.hparams.num_workers)
validator = IMValidator()
self.dataloaders = dc(**filter_datasets(datasets, validator))
self.data_train = self.dataloaders.pop("train")
self.data_val = list(self.dataloaders.values())
return
def train_dataloader(self):
return self.data_train
def val_dataloader(self):
return self.data_val
def test_dataloader(self):
# how to make a test dataset?
return
self.dataloaders produces the following object
{'src_train': SourceDataset(
domain=0
(dataset): ConcatDataset(
len=60000
(datasets): [Dataset MNIST
Number of datapoints: 60000
Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
Split: Train
StandardTransform
Transform: Compose(
Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
ToTensor()
<pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd1badcbdc0>
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)]
)
), 'src_val': SourceDataset(
domain=0
(dataset): ConcatDataset(
len=10000
(datasets): [Dataset MNIST
Number of datapoints: 10000
Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
Split: Test
StandardTransform
Transform: Compose(
Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
ToTensor()
<pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd1badcb6a0>
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)]
)
), 'target_train': TargetDataset(
domain=1
(dataset): ConcatDataset(
len=59001
(datasets): [MNISTM(
domain=MNISTM
len=59001
(transform): Compose(
ToTensor()
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)
)]
)
), 'target_val': TargetDataset(
domain=1
(dataset): ConcatDataset(
len=9001
(datasets): [MNISTM(
domain=MNISTM
len=9001
(transform): Compose(
ToTensor()
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)
)]
)
), 'train': CombinedSourceAndTargetDataset(
(source_dataset): SourceDataset(
domain=0
(dataset): ConcatDataset(
len=60000
(datasets): [Dataset MNIST
Number of datapoints: 60000
Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
Split: Train
StandardTransform
Transform: Compose(
Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
ToTensor()
<pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd125f69d60>
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)]
)
)
(target_dataset): TargetDataset(
domain=1
(dataset): ConcatDataset(
len=59001
(datasets): [MNISTM(
domain=MNISTM
len=59001
(transform): Compose(
ToTensor()
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)
)]
)
)
)}
This handles train and val for source and target as well as creating a conjoined train dataset.
Going by the example ipynb, the concat dataset for train (of source and target) is used as the training dataset for the model.
The validation set is a list of the remaining keys in the data-loader and has the following form.
[
<torch.utils.data.dataloader.DataLoader object at 0x7fd1063e6b80> {
dataset: TargetDataset(
domain=1
(dataset): ConcatDataset(
len=59001
(datasets): [MNISTM(
domain=MNISTM
len=59001
(transform): Compose(
ToTensor()
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)
)]
)
)
}
]
I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.
In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?
Many thanks, Eoghan
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
@deepseek-eoghan I updated the docs. Re: contributing to the docs, that would be very helpful, but I think example jupyter notebooks would be even better since I assume that’s where most people look first.
This is an unsolved problem in unsupervised domain adaptation. We want high accuracy on the unlabeled target domain, but since it is unlabeled, it is difficult to determine the model’s performance.
Whether or not we validate only on the target domain depends on the type of validator. The IMValidator uses only the target domain to compute a validation score, which is why the validation dataloaders returned by
filter_datasets
consists of only the target domain:You could use a validator that adds source val accuracy plus the IM score:
Now the validation dataloaders should consist of the
src_val
set and thetarget_train
set. Note that thetarget_train
set is used for validation, because it is assumed that thetarget_val
set is reserved for testing. (This is a bit confusing, however it’s the most realistic setting in my opinion. I can expand on this if you want.)You can make the IMValidator use
target_val
instead oftarget_train
like this:You can split the datasets however you want. As long as the DataloaderCreator recognizes the names of the splits you pass in:
Let me know if you have more questions!