Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CropForegroundd not support 'ddp_spwan' distributed strategy?

See original GitHub issue

*Describe the bug When I use the following transform for distributed training，I got a error

train_transforms = Compose([LoadImaged(keys=["image", "label"]),
                                                ScaleIntensityRanged(
                                                keys=["image"], a_min=45, a_max=167,
                                                b_min=0.0, b_max=1.0, clip=True,
                                            ),
                                            AddChanneld(keys=["image", "label"]),
                                           CropForegroundd(keys["image","label"],
                                           source_key="label",select_fn=self.threshold_lager_one,margin=20),
                                            Resized(keys=["image", "label"], spatial_size=[256,256,256],
                                                    mode=("trilinear", "nearest"), align_corners=(False, None)),
                                            ])

the erro is that Default process group has not been initialized, please make sure to call init_process_group

To Reproduce Here is my pytorch_lightning trainer setting

trainer = pytorch_lightning.Trainer(
        gpus=[0,1],
        stategy = 'ddp_spawn',
        max_epochs=50,
        logger=tb_logger,
        checkpoint_callback=True,
        num_sanity_val_steps=1,
        check_val_every_n_epoch=5,
        log_every_n_steps=1,
    )

When I annotate CropForegroundd , distributed training can work, strange 2022-03-20 13-47-33 的屏幕截图

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

cvlearn913commented, Mar 21, 2022

for 2d fake data

import monai
from monai.utils import set_determinism
from PIL import Image
from monai.transforms import (
    AsDiscrete,
    Compose,
    EnsureType,
    Activations, LoadImaged, ScaleIntensityRanged, AddChanneld, CropForegroundd, Resized
)
from monai.networks.nets import UNet
from monai.networks.layers import Norm
from monai.metrics import DiceMetric
from monai.losses import DiceLoss
from monai.inferers import SimpleInferer
from monai.data import list_data_collate, decollate_batch, create_test_image_2d, CacheDataset, DataLoader, Dataset, \
    create_test_image_3d
import torch
import pytorch_lightning

import glob
import  os
import matplotlib
matplotlib.use('agg')

class Net(pytorch_lightning.LightningModule):
    def __init__(self):
        super().__init__()
        self._model = UNet(
            spatial_dims=2,
            in_channels=1,
            out_channels=1,
            channels=(16, 32, 64, 128, 256),
            strides=(2, 2, 2, 2),
            num_res_units=0,
            norm=Norm.BATCH,
        )
        self.loss_function = DiceLoss(to_onehot_y=True,sigmoid=True)
        self.post_pred = Compose([EnsureType(), Activations(sigmoid=True)])
        self.post_label = Compose([EnsureType()])
        self.dice_metric = DiceMetric(include_background=False, reduction="mean_batch", get_not_nans=False)


        self.best_val_dice = 0
        self.best_val_epoch = 0

    def forward(self, x):
        return self._model(x).type_as(x)

    def threshold_lager_one(self, x):
        # threshold at 0
        return x > 0

    def prepare_data(self):

        tempdir_train = "./tempdir/imagesTr"
        tempdir_test = "./tempdir/labelsTr"
        for i in range(20):
            im, seg = create_test_image_2d(512, 512, num_seg_classes=1)
            Image.fromarray((im * 255).astype("uint8")).save(os.path.join(tempdir_train, f"img{i:d}.png"))
            Image.fromarray((seg * 255).astype("uint8")).save(os.path.join(tempdir_test, f"seg{i:d}.png"))



        train_images = sorted(
            glob.glob(os.path.join(tempdir_train,  "img*.png")))
        train_labels = sorted(
            glob.glob(os.path.join(tempdir_test,  "seg*.png")))
        data_dicts = [
            {"image": image_name, "label": label_name}
            for image_name, label_name in zip(train_images, train_labels)
        ]
        # get files randomly

        train_files, val_files = data_dicts, data_dicts

        # set deterministic training for reproducibility
        set_determinism(seed=0)

        # define the data transforms
        train_transforms =Compose([LoadImaged(keys=["image", "label"]),
                                   ScaleIntensityRanged(
                                   keys=["image"], a_min=45, a_max=167,
                                   b_min=0.0, b_max=1.0, clip=True,
                                    ),

                                   CropForegroundd(keys=["image","label"],
                                   source_key="label",select_fn=self.threshold_lager_one,margin=20),
                                   AddChanneld(keys=["image","label"]),
                                   Resized(keys=["image", "label"], spatial_size=[256,256],
                                                    mode=("bilinear", "nearest"), align_corners=(False, None)),
                                            ])
        val_transforms = Compose([LoadImaged(keys=["image", "label"]),
                                   ScaleIntensityRanged(
                                   keys=["image"], a_min=45, a_max=167,
                                   b_min=0.0, b_max=1.0, clip=True,
                                    ),

                                   CropForegroundd(keys=["image","label"],
                                   source_key="label",select_fn=self.threshold_lager_one,margin=20),
                                   AddChanneld(keys=["image", "label"]),
                                   Resized(keys=["image", "label"], spatial_size=[256,256],
                                                    mode=("bilinear", "nearest"), align_corners=(False, None)),
                                            ])



        self.train_ds = Dataset(
            data=train_files, transform=train_transforms,


        )

        self.val_ds = Dataset(
            data=val_files, transform=val_transforms,

        )



    def train_dataloader(self):
        train_loader = DataLoader(
            self.train_ds, batch_size=1, shuffle=True,
            num_workers=1, collate_fn=list_data_collate,
        )
        return train_loader

    def val_dataloader(self):
        val_loader = DataLoader(
            self.val_ds, batch_size=1, num_workers=1)
        return val_loader



    def configure_optimizers(self):

        optimizer = monai.optimizers.Novograd(self._model.parameters(),1e-3)
        return optimizer

    def training_step(self, batch, batch_idx):
        images, labels = batch["image"], batch["label"]
        output = self.forward(images)
        loss = self.loss_function(output, labels)
        tensorboard_logs = {"train_loss": loss.item()}
        return {"loss": loss, "log": tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        images, labels = batch["image"], batch["label"]

        sinf = SimpleInferer()
        outputs = sinf(inputs=images,network=self._model)
        loss = self.loss_function(outputs, labels)
        outputs = [self.post_pred(i) for i in decollate_batch(outputs)]
        labels = [self.post_label(i) for i in decollate_batch(labels)]
        self.dice_metric(y_pred=outputs, y=labels)
        file_name = os.path.basename(batch['image_meta_dict']['filename_or_obj'][0])
        print(file_name+":",self.dice_metric.aggregate().item())
        return {"val_loss": loss, "val_number": len(outputs)}

    def validation_epoch_end(self, outputs):
        val_loss, num_items = 0, 0
        for output in outputs:
            val_loss += output["val_loss"].sum().item()
            num_items += output["val_number"]
        current_val_dice = self.dice_metric.aggregate().item()
        self.dice_metric.reset()
        mean_val_loss = torch.tensor(val_loss / num_items)
        tensorboard_logs = {
            "val_dice": current_val_dice,
            "val_loss": mean_val_loss,
        }
        if current_val_dice > self.best_val_dice:
            self.best_val_dice = current_val_dice
            self.best_val_epoch = self.current_epoch
        print(
            f"current epoch: {self.current_epoch} "
            f"current mean dice: {current_val_dice:.4f}"
            f"\nbest mean dice: {self.best_val_dice:.4f} "
            f"at epoch: {self.best_val_epoch}"
        )
        return {"log": tensorboard_logs}







def train():

    net = Net()
    # set up loggers and checkpoints
    log_dir = os.path.join('./tempdir', "logs")
    tb_logger = pytorch_lightning.loggers.TensorBoardLogger(
        save_dir=log_dir
    )

    # initialise Lightning's trainer.
    trainer = pytorch_lightning.Trainer(
        gpus=[0,1],
        strategy="ddp_spawn",
        max_epochs=100,
        logger=tb_logger,
        checkpoint_callback=True,
        num_sanity_val_steps=1,
        check_val_every_n_epoch=5
    )

    # train
    trainer.fit(net)

if __name__=='__main__':
    train()

1reaction

cvlearn913commented, Mar 20, 2022

of course ！ Thanks for your help! @Nic-Ma.

def threshold_lager_one(self,x):
#threshold at 0
return x>0

Top Results From Across the Web

CropForegroundd does not adjust the affine accordingly

Describe the bug We are using the CropForegroundd transform. ... It can support to crop ND spatial (channel-first) data.

Distributed training with TensorFlow

Strategy is a TensorFlow API to distribute training across multiple GPUs, ... for debugging purposes and not supported for tf.distribute.

Transforms — MONAI 1.1.0 Documentation

If a dimension of the expected ROI size is larger than the input image size, will not crop that dimension. So the cropped...

Inside TensorFlow: tf.distribute.Strategy - YouTube

Take an inside look into the TensorFlow team's own internal training sessions--technical deep dives into TensorFlow by the very people who ...

Simplified distributed training with tf.distribute parameter servers

Learn about a new tf. distribute strategy, ParameterServerStrategy, which enables asynchronous distributed training in TensorFlow, along with ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

CropForegroundd not support 'ddp_spwan' distributed strategy?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

SaveImaged add feature for dictionary-based output postfix

torch 1.6 unit test test_keep_largest_connected_component