Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Bug] Failed to register worker to Raylet for single node, multi-GPU

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

I am trying to run the official tutorial for PyTorch Lightning. It works fine one a single GPU, but fails when the requested resources per trial are more than one GPU

# Normal behavior with a single GPU
$ python tutorial.py --limit_batches 1 --num_epochs 1 --num_samples 1 --gpus_per_trial 1
Best hyperparameters found were:  {'layer_1_size': 128, 'layer_2_size': 256, 'lr': 0.0032251519139857242, 'batch_size': 32}

# Worker registration error when requesting multiple GPUs
$ python tutorial.py --limit_batches 1 --num_epochs 1 --num_samples 1 --gpus_per_trial 4
(ImplicitFunc pid=58090) [2021-12-21 21:59:48,344 E 58996 58996] core_worker.cc:451: 
Failed to register worker 2ea0a7ae0dcfec5f2917ed1d37227bb2190a87c16f85aa3f859cd7ef to Raylet. 
Invalid: Invalid: Unknown worker
...
Trial shows as RUNNING, but never progresses

This is on a single node/machine that has 4 GPUs attached. Based on PyTorch Lightning’s trainer, I would expect Ray to be able to distribute trials across all the available GPUs when they are requested as resources

Versions / Dependencies

System

Python 3.9.7
Ubuntu 20.04 / AWS p3.8xlarge (with 4 Nvidia A100s)
CUDA 11.5

requirements.txt

pytorch-lightning<1.5
ray[tune]==1.9.0
-f https://download.pytorch.org/whl/cu113/torch_stable.html
torch==1.10.0+cu113
torchvision==0.11.1+cu113

Reproduction script

tutorial.py

from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
from filelock import FileLock
import math
import os

import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from ray import tune
from ray.tune import CLIReporter
from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback
from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets import MNIST


class LightningMNISTClassifier(pl.LightningModule):
    """
    This has been adapted from
    https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09
    """

    def __init__(self, config, data_dir=None):
        super(LightningMNISTClassifier, self).__init__()

        self.data_dir = data_dir or os.getcwd()

        self.layer_1_size = config["layer_1_size"]
        self.layer_2_size = config["layer_2_size"]
        self.lr = config["lr"]
        self.batch_size = config["batch_size"]

        # mnist images are (1, 28, 28) (channels, width, height)
        self.layer_1 = torch.nn.Linear(28 * 28, self.layer_1_size)
        self.layer_2 = torch.nn.Linear(self.layer_1_size, self.layer_2_size)
        self.layer_3 = torch.nn.Linear(self.layer_2_size, 10)

    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.view(batch_size, -1)

        x = self.layer_1(x)
        x = torch.relu(x)

        x = self.layer_2(x)
        x = torch.relu(x)

        x = self.layer_3(x)
        x = torch.log_softmax(x, dim=1)

        return x

    def cross_entropy_loss(self, logits, labels):
        return F.nll_loss(logits, labels)

    def accuracy(self, logits, labels):
        _, predicted = torch.max(logits.data, 1)
        correct = (predicted == labels).sum().item()
        accuracy = correct / len(labels)
        return torch.tensor(accuracy)

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)

        self.log("ptl/train_loss", loss)
        self.log("ptl/train_accuracy", accuracy)
        return loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)
        return {"val_loss": loss, "val_accuracy": accuracy}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        avg_acc = torch.stack([x["val_accuracy"] for x in outputs]).mean()
        self.log("ptl/val_loss", avg_loss)
        self.log("ptl/val_accuracy", avg_acc)

    @staticmethod
    def download_data(data_dir):
        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307, ), (0.3081, ))
        ])
        with FileLock(os.path.expanduser("~/.data.lock")):
            return MNIST(data_dir, train=True, download=True, transform=transform)

    def prepare_data(self):
        mnist_train = self.download_data(self.data_dir)

        self.mnist_train, self.mnist_val = random_split(
            mnist_train, [55000, 5000])

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=int(self.batch_size))

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=int(self.batch_size))

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer


def train_mnist_tune(config, num_epochs=10, num_gpus=0, data_dir="~/data", limit_batches=None):
    data_dir = os.path.expanduser(data_dir)
    model = LightningMNISTClassifier(config, data_dir)
    trainer_kwargs = {
        'max_epochs': num_epochs,
        'gpus': math.ceil(num_gpus),
        'logger': TensorBoardLogger(
            save_dir=tune.get_trial_dir(), name="", version="."),
        'progress_bar_refresh_rate': 0,
        'callbacks': [
            TuneReportCallback(
                {"loss": "ptl/val_loss", "mean_accuracy": "ptl/val_accuracy"},
                on="validation_end")]}
    if num_gpus > 1:
        trainer_kwargs.update({'accelerator': 'ddp'})  # Default ddp_spawn doesn't serialize well
    if limit_batches is not None:
        trainer_kwargs.update({
            'limit_train_batches': limit_batches,
            'limit_val_batches': limit_batches})
    trainer = pl.Trainer(**trainer_kwargs)
    trainer.fit(model)


def tune_mnist_asha(num_samples=10, num_epochs=10, gpus_per_trial=0, data_dir="~/data", limit_batches=None):
    config = {
        "layer_1_size": tune.choice([32, 64, 128]),
        "layer_2_size": tune.choice([64, 128, 256]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([32, 64, 128]),
    }

    scheduler = ASHAScheduler(
        max_t=num_epochs,
        grace_period=1,
        reduction_factor=2)

    reporter = CLIReporter(
        parameter_columns=["layer_1_size", "layer_2_size", "lr", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    train_fn_with_parameters = tune.with_parameters(
        train_mnist_tune,
        num_epochs=num_epochs,
        num_gpus=gpus_per_trial,
        data_dir=data_dir,
        limit_batches=None)
    resources_per_trial = {"cpu": 1, "gpu": gpus_per_trial}

    analysis = tune.run(train_fn_with_parameters,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter,
        name="tune_mnist_asha")

    print("Best hyperparameters found were: ", analysis.best_config)


if __name__ == '__main__':

    parser = ArgumentParser(
        description='Tune hyperparameters for MNIST with PyTorch Lightning',
        formatter_class=lambda prog: ArgumentDefaultsHelpFormatter(prog, width=120, max_help_position=60))
    parser.add_argument('-e', '--num_epochs', type=int, default=10, help='maximum number of training epochs')
    parser.add_argument('-g', '--gpus_per_trial', type=float, default=1, help='GPU allocation per Raytune trial')
    parser.add_argument('-l', '--limit_batches', type=int, help='for pl.Trainer, applying to both train/val')
    parser.add_argument('-s', '--num_samples', type=int, default=10, help='number of times to sample hparam space')
    args = parser.parse_args()

    os.environ['RAY_worker_register_timeout_seconds'] = '30'
    tune_mnist_asha(**vars(args))

Anything else

Based on this discussion post, I tried setting the environment variable for RAY_worker_register_timeout_seconds but it does not fix the issue

cc @ericl @rkooo567 @iycheng (from the request on #8890

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

zhe-thoughtscommented, Nov 2, 2022

Looks like a P1. I’m putting this into Core team backlog and let’s discuss how to fix.

1reaction

rkooo567commented, Jan 13, 2022

Btw the default timeout is 30, so you should experiment with values like 60.

@iycheng maybe you can take a look at this? I think it could be related to our recent gcs changes, or there’s a failure from worker initialization (which could also be related to recent changes)

Top Results From Across the Web

[Core] Ray.init() hanging

I've looked at the log of when it fails in /tmp/ray/{expid}/logs and I'm ... [Bug] Failed to register worker to Raylet for single...

Ray Documentation - Read the Docs

One local scheduler per node assigns tasks to workers on the same node. • A driver is the Python process that the user...

Run MATLAB Functions on Multiple GPUs - MathWorks

Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 2). Define the number of simulations,...

ray Changelog - pyup.io

Multi -GPU learner thread key error in MA-scenarios (24382) ... Fix one bug where max_retries is not aligned with ray core's max_retries. (22903)...

Distributed training with TensorFlow

One of the key differences to get multi worker training going, as compared to multi-GPU training, is the multi-worker setup. The 'TF_CONFIG' ...