Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Train: adding train.save_checkpoint(epoch=epoch, model=model) hangs the training in CPU-only mode

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core, Ray Train

What happened + What you expected to happen

I was closely following this tutorial here and these customizations. When I have train.save_checkpoint(epoch=epoch, model=model) in my script - then it hangs. But when I have `train.save_checkpoint(epoch=epoch) - then it works.

Here are the error messages I’m getting during hanging.

(BaseWorkerMixin pid=89188) 2021-12-01 12:38:15.378819: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9
(BaseWorkerMixin pid=89188) Additional GRPC error information from remote target /job:worker/replica:0/task:0:
(BaseWorkerMixin pid=89188) :{"created":"@1638383895.378759000","description":"Error received from peer ipv4:127.0.0.1:65311","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9","grpc_status":13}
(BaseWorkerMixin pid=89187) 2021-12-01 12:38:15.376738: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9
(BaseWorkerMixin pid=89187) Additional GRPC error information from remote target /job:worker/replica:0/task:0:
(BaseWorkerMixin pid=89187) :{"created":"@1638383895.376645000","description":"Error received from peer ipv4:127.0.0.1:65311","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9","grpc_status":13}
 1/15 [=>............................] - ETA: 13s - loss: 2.2937 - accuracy: 0.1152
(BaseWorkerMixin pid=89189) 2021-12-01 12:38:15.368976: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9
(BaseWorkerMixin pid=89189) Additional GRPC error information from remote target /job:worker/replica:0/task:0:
(BaseWorkerMixin pid=89189) :{"created":"@1638383895.368312000","description":"Error received from peer ipv4:127.0.0.1:65311","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9","grpc_status":13}
 1/15 [=>............................] - ETA: 13s - loss: 2.2937 - accuracy: 0.1152

When I do Ctrl+C first time it gives me this:

^CTraceback (most recent call last):
  File "training_reproducer.py", line 188, in <module>
    "epochs": 2
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 299, in run
    for intermediate_result in iterator:
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 665, in __next__
    next_results = self._run_with_error_handling(self._fetch_next_result)
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 638, in _run_with_error_handling
    return func()
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 691, in _fetch_next_result
    self._backend_executor_actor.get_next_results.remote())
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1721, in get
    object_refs, timeout=timeout)
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 354, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 1168, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 153, in ray._raylet.check_status
KeyboardInterrupt

And for the second Ctrl+C:

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/node.py", line 988, in _kill_process_type
    wait=wait)
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/node.py", line 1040, in _kill_process_impl
    process.wait(timeout_seconds)
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/Users/user/anaconda3/envs/ray/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)

Have a feeling that there is something wrong in the synchronization between the workers checkpointing and processing - sort of getting into deadlock.

Versions / Dependencies

python - 3.7.10; 3.8 ray - 1.8; 1.9rc2; nightly-build (2.0-dev) tensorflow - 2.7

I tested it on CentO7 (7 - L3.10 Intel® Xeon® Gold 6140 CPU @ 2.30GHz - Skylake 36 cores) And on Mac book Pro (BigSur 11.6 Mid 2014 - 2.8 GHz Quad-Core Intel Core i7 Haswell 4 cores)

Honestly - I hoped that it would work at least at Mac

Reproduction script

import numpy as np

def mnist_dataset(batch_size):
    (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
    # The `x` arrays are in uint8 and have values in the [0, 255] range.
    # You need to convert them to float32 with values in the [0, 1] range.
    x_train = x_train / np.float32(255)
    y_train = y_train.astype(np.int64)
    train_dataset = tf.data.Dataset.from_tensor_slices(
        (x_train, y_train)).shuffle(60000).repeat(1).batch(batch_size)
    return train_dataset


def build_and_compile_cnn_model():
    model = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(28, 28)),
        tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10)
    ])
    model.compile(
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
        metrics=['accuracy'])
    return model

import json
import os
from ray import train
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
os.environ["TRAIN_RESULT_ENABLE_DETAILED_AUTOFILLED_METRICS"] = "1"

import tensorflow as tf
from tensorflow.keras.callbacks import Callback

def train_func():
    """
    Single worker training
    """
    
    os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

    batch_size = 64
    single_worker_dataset = mnist.mnist_dataset(batch_size)
    single_worker_model = mnist.build_and_compile_cnn_model()
    single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70)

class TrainReportCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        train.report(**logs)

def train_func_distributed(config):
    """
    Multiworker training
    """

    os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

    batch_size = config.get("batch_size", 64)
    epochs = config.get("epochs", 3)
    steps_per_epoch = config.get("epochs", 70)

    per_worker_batch_size = batch_size
    # This environment variable will be set by Ray Train.
    tf_config = json.loads(os.environ['TF_CONFIG'])
    num_workers = len(tf_config['cluster']['worker'])

    strategy = tf.distribute.MultiWorkerMirroredStrategy()

    global_batch_size = per_worker_batch_size * num_workers
    multi_worker_dataset = mnist_dataset(global_batch_size)

    with strategy.scope():
        # Model building/compiling need to be within `strategy.scope()`.
        multi_worker_model = build_and_compile_cnn_model()

    #multi_worker_model.fit(multi_worker_dataset, epochs=epochs, steps_per_epoch=steps_per_epoch)
    results = []
    checkpoint = train.load_checkpoint() or {}
    
    model = checkpoint.get("model", multi_worker_model)
    start_epoch = checkpoint.get("epoch", -1) + 1
    for epoch in range(start_epoch, epochs):
        history = model.fit(
            multi_worker_dataset,
            callbacks=[TrainReportCallback()]
        )
        # COMMENT THIS OUT TO FIX THE HANGING
        train.save_checkpoint(epoch=epoch,model=model)
        results.append(history.history)
    return results

if __name__ == "__main__":
    
    import argparse

    try:
        import ray
    except:
        # Just train using basic method
        train_func()
        exit(0)
    from ray.train import Trainer

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address",
        required=False,
        type=str,
        help="the address to use for Ray")
    parser.add_argument(
        "--num-workers",
        "-n",
        type=int,
        default=4,
        help="Sets number of workers for training.")
    parser.add_argument(
        "--use-gpu",
        action="store_true",
        default=False,
        help="Enables GPU training")
    parser.add_argument(
        "--smoke-test",
        action="store_true",
        default=False,
        help="Finish quickly for testing.")
    parser.add_argument(
        "--ray-log-dir",
        "-d",
        nargs='?',
        default=os.getcwd(),
        type=str,
        required=False,
        help="Log Dir to store atrifacts of training (defult `pwd`).")

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        # 1 for datasets
        num_cpus = args.num_workers + 1
        num_gpus = args.num_workers if args.use_gpu else 0
        ray.init(num_cpus=num_cpus, num_gpus=num_gpus)
    else:
        ray.init(address=args.address)

    LOGDIR = args.ray_log_dir

    # train using all workers from ray
    print("Set trainer")
    trainer = Trainer(
        backend="tensorflow",
        num_workers=args.num_workers,
        logdir=LOGDIR,
        use_gpu=args.use_gpu
    )
    print("Start trainer")
    trainer.start()
    print("Run trainer")
    results = trainer.run(
        train_func_distributed,
        config={
            "lr": 1e-3,
            "batch_size": 1024,
            "epochs": 2
        }
    )
    print("Shutdown trainer")
    trainer.shutdown()
    print(f"Results: {results[0]}")
    print(trainer.latest_checkpoint)
    print(trainer.latest_checkpoint_dir)
    print(trainer.latest_checkpoint_path)

Anything else

I run it like this:

python training_reproducer.py --smoke-test

Please note that I’m doing CPU only training here. This is intensional since I’m trying to scale it across many CPU-nodes.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

richardliawcommented, Dec 3, 2021

@amogkam or @matthewdeng are the right folks to take this issue.

I’ll assign it to both of them for them to triage for now.

0reactions

matthewdengcommented, Dec 6, 2021

@thoth291 yep totally agreed, I just created an issue to track this documentation update.

Marking this issue as closed, please open another issue if you run into any problems with Tune!

Top Results From Across the Web

The training always freezes after some epochs. #22671 - GitHub

Bug The training always freezes after some epochs. GPU usage is constantly 100%, the data loader also stops working. No error information.

Checkpointing — PyTorch Lightning 1.6.3 documentation

Lightning provides functions to save and load checkpoints. Checkpointing your training allows you to resume a training process in case it was interrupted,...

Ray Train hangs for long time

I am using Ray Train for hyperparameter tuning. The config is shown below def main(args, num_samples=2): trainer = Trainer( "torch", ...

PyTorch nn.DataParallel hang

Hi, My training hangs in 1st epoch after training function i.e. just before validation when I train it on p3.8x having 4 gpu...

Trainer — transformers 4.5.0.dev0 documentation

The model to train, evaluate or use for predictions. ... (and the Trainer will manually set the seed of this generator at each...