question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CropForegroundd not support 'ddp_spwan' distributed strategy?

See original GitHub issue

*Describe the bug When I use the following transform for distributed training,I got a error

train_transforms = Compose([LoadImaged(keys=["image", "label"]),
                                                ScaleIntensityRanged(
                                                keys=["image"], a_min=45, a_max=167,
                                                b_min=0.0, b_max=1.0, clip=True,
                                            ),
                                            AddChanneld(keys=["image", "label"]),
                                           CropForegroundd(keys["image","label"],
                                           source_key="label",select_fn=self.threshold_lager_one,margin=20),
                                            Resized(keys=["image", "label"], spatial_size=[256,256,256],
                                                    mode=("trilinear", "nearest"), align_corners=(False, None)),
                                            ])

the erro is that Default process group has not been initialized, please make sure to call init_process_group

To Reproduce Here is my pytorch_lightning trainer setting

trainer = pytorch_lightning.Trainer(
        gpus=[0,1],
        stategy = 'ddp_spawn',
        max_epochs=50,
        logger=tb_logger,
        checkpoint_callback=True,
        num_sanity_val_steps=1,
        check_val_every_n_epoch=5,
        log_every_n_steps=1,
    )

When I annotate CropForegroundd , distributed training can work, strange 2022-03-20 13-47-33 的屏幕截图

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
cvlearn913commented, Mar 21, 2022

for 2d fake data

import monai
from monai.utils import set_determinism
from PIL import Image
from monai.transforms import (
    AsDiscrete,
    Compose,
    EnsureType,
    Activations, LoadImaged, ScaleIntensityRanged, AddChanneld, CropForegroundd, Resized
)
from monai.networks.nets import UNet
from monai.networks.layers import Norm
from monai.metrics import DiceMetric
from monai.losses import DiceLoss
from monai.inferers import SimpleInferer
from monai.data import list_data_collate, decollate_batch, create_test_image_2d, CacheDataset, DataLoader, Dataset, \
    create_test_image_3d
import torch
import pytorch_lightning

import glob
import  os
import matplotlib
matplotlib.use('agg')

class Net(pytorch_lightning.LightningModule):
    def __init__(self):
        super().__init__()
        self._model = UNet(
            spatial_dims=2,
            in_channels=1,
            out_channels=1,
            channels=(16, 32, 64, 128, 256),
            strides=(2, 2, 2, 2),
            num_res_units=0,
            norm=Norm.BATCH,
        )
        self.loss_function = DiceLoss(to_onehot_y=True,sigmoid=True)
        self.post_pred = Compose([EnsureType(), Activations(sigmoid=True)])
        self.post_label = Compose([EnsureType()])
        self.dice_metric = DiceMetric(include_background=False, reduction="mean_batch", get_not_nans=False)


        self.best_val_dice = 0
        self.best_val_epoch = 0

    def forward(self, x):
        return self._model(x).type_as(x)

    def threshold_lager_one(self, x):
        # threshold at 0
        return x > 0

    def prepare_data(self):

        tempdir_train = "./tempdir/imagesTr"
        tempdir_test = "./tempdir/labelsTr"
        for i in range(20):
            im, seg = create_test_image_2d(512, 512, num_seg_classes=1)
            Image.fromarray((im * 255).astype("uint8")).save(os.path.join(tempdir_train, f"img{i:d}.png"))
            Image.fromarray((seg * 255).astype("uint8")).save(os.path.join(tempdir_test, f"seg{i:d}.png"))



        train_images = sorted(
            glob.glob(os.path.join(tempdir_train,  "img*.png")))
        train_labels = sorted(
            glob.glob(os.path.join(tempdir_test,  "seg*.png")))
        data_dicts = [
            {"image": image_name, "label": label_name}
            for image_name, label_name in zip(train_images, train_labels)
        ]
        # get files randomly

        train_files, val_files = data_dicts, data_dicts

        # set deterministic training for reproducibility
        set_determinism(seed=0)

        # define the data transforms
        train_transforms =Compose([LoadImaged(keys=["image", "label"]),
                                   ScaleIntensityRanged(
                                   keys=["image"], a_min=45, a_max=167,
                                   b_min=0.0, b_max=1.0, clip=True,
                                    ),

                                   CropForegroundd(keys=["image","label"],
                                   source_key="label",select_fn=self.threshold_lager_one,margin=20),
                                   AddChanneld(keys=["image","label"]),
                                   Resized(keys=["image", "label"], spatial_size=[256,256],
                                                    mode=("bilinear", "nearest"), align_corners=(False, None)),
                                            ])
        val_transforms = Compose([LoadImaged(keys=["image", "label"]),
                                   ScaleIntensityRanged(
                                   keys=["image"], a_min=45, a_max=167,
                                   b_min=0.0, b_max=1.0, clip=True,
                                    ),

                                   CropForegroundd(keys=["image","label"],
                                   source_key="label",select_fn=self.threshold_lager_one,margin=20),
                                   AddChanneld(keys=["image", "label"]),
                                   Resized(keys=["image", "label"], spatial_size=[256,256],
                                                    mode=("bilinear", "nearest"), align_corners=(False, None)),
                                            ])



        self.train_ds = Dataset(
            data=train_files, transform=train_transforms,


        )

        self.val_ds = Dataset(
            data=val_files, transform=val_transforms,

        )



    def train_dataloader(self):
        train_loader = DataLoader(
            self.train_ds, batch_size=1, shuffle=True,
            num_workers=1, collate_fn=list_data_collate,
        )
        return train_loader

    def val_dataloader(self):
        val_loader = DataLoader(
            self.val_ds, batch_size=1, num_workers=1)
        return val_loader



    def configure_optimizers(self):

        optimizer = monai.optimizers.Novograd(self._model.parameters(),1e-3)
        return optimizer

    def training_step(self, batch, batch_idx):
        images, labels = batch["image"], batch["label"]
        output = self.forward(images)
        loss = self.loss_function(output, labels)
        tensorboard_logs = {"train_loss": loss.item()}
        return {"loss": loss, "log": tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        images, labels = batch["image"], batch["label"]

        sinf = SimpleInferer()
        outputs = sinf(inputs=images,network=self._model)
        loss = self.loss_function(outputs, labels)
        outputs = [self.post_pred(i) for i in decollate_batch(outputs)]
        labels = [self.post_label(i) for i in decollate_batch(labels)]
        self.dice_metric(y_pred=outputs, y=labels)
        file_name = os.path.basename(batch['image_meta_dict']['filename_or_obj'][0])
        print(file_name+":",self.dice_metric.aggregate().item())
        return {"val_loss": loss, "val_number": len(outputs)}

    def validation_epoch_end(self, outputs):
        val_loss, num_items = 0, 0
        for output in outputs:
            val_loss += output["val_loss"].sum().item()
            num_items += output["val_number"]
        current_val_dice = self.dice_metric.aggregate().item()
        self.dice_metric.reset()
        mean_val_loss = torch.tensor(val_loss / num_items)
        tensorboard_logs = {
            "val_dice": current_val_dice,
            "val_loss": mean_val_loss,
        }
        if current_val_dice > self.best_val_dice:
            self.best_val_dice = current_val_dice
            self.best_val_epoch = self.current_epoch
        print(
            f"current epoch: {self.current_epoch} "
            f"current mean dice: {current_val_dice:.4f}"
            f"\nbest mean dice: {self.best_val_dice:.4f} "
            f"at epoch: {self.best_val_epoch}"
        )
        return {"log": tensorboard_logs}







def train():

    net = Net()
    # set up loggers and checkpoints
    log_dir = os.path.join('./tempdir', "logs")
    tb_logger = pytorch_lightning.loggers.TensorBoardLogger(
        save_dir=log_dir
    )

    # initialise Lightning's trainer.
    trainer = pytorch_lightning.Trainer(
        gpus=[0,1],
        strategy="ddp_spawn",
        max_epochs=100,
        logger=tb_logger,
        checkpoint_callback=True,
        num_sanity_val_steps=1,
        check_val_every_n_epoch=5
    )

    # train
    trainer.fit(net)

if __name__=='__main__':
    train()


1reaction
cvlearn913commented, Mar 20, 2022

of course ! Thanks for your help! @Nic-Ma.

def threshold_lager_one(self,x):
#threshold at 0
return x>0
Read more comments on GitHub >

github_iconTop Results From Across the Web

CropForegroundd does not adjust the affine accordingly
Describe the bug We are using the CropForegroundd transform. ... It can support to crop ND spatial (channel-first) data.
Read more >
Distributed training with TensorFlow
Strategy is a TensorFlow API to distribute training across multiple GPUs, ... for debugging purposes and not supported for tf.distribute.
Read more >
Transforms — MONAI 1.1.0 Documentation
If a dimension of the expected ROI size is larger than the input image size, will not crop that dimension. So the cropped...
Read more >
Inside TensorFlow: tf.distribute.Strategy - YouTube
Take an inside look into the TensorFlow team's own internal training sessions--technical deep dives into TensorFlow by the very people who ...
Read more >
Simplified distributed training with tf.distribute parameter servers
Learn about a new tf. distribute strategy, ParameterServerStrategy, which enables asynchronous distributed training in TensorFlow, along with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found