[Bug] Train: adding train.save_checkpoint(epoch=epoch, model=model) hangs the training in CPU-only mode
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core, Ray Train
What happened + What you expected to happen
I was closely following this tutorial here and these customizations.
When I have train.save_checkpoint(epoch=epoch, model=model)
in my script - then it hangs.
But when I have `train.save_checkpoint(epoch=epoch) - then it works.
Here are the error messages I’m getting during hanging.
(BaseWorkerMixin pid=89188) 2021-12-01 12:38:15.378819: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9
(BaseWorkerMixin pid=89188) Additional GRPC error information from remote target /job:worker/replica:0/task:0:
(BaseWorkerMixin pid=89188) :{"created":"@1638383895.378759000","description":"Error received from peer ipv4:127.0.0.1:65311","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9","grpc_status":13}
(BaseWorkerMixin pid=89187) 2021-12-01 12:38:15.376738: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9
(BaseWorkerMixin pid=89187) Additional GRPC error information from remote target /job:worker/replica:0/task:0:
(BaseWorkerMixin pid=89187) :{"created":"@1638383895.376645000","description":"Error received from peer ipv4:127.0.0.1:65311","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9","grpc_status":13}
1/15 [=>............................] - ETA: 13s - loss: 2.2937 - accuracy: 0.1152
(BaseWorkerMixin pid=89189) 2021-12-01 12:38:15.368976: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9
(BaseWorkerMixin pid=89189) Additional GRPC error information from remote target /job:worker/replica:0/task:0:
(BaseWorkerMixin pid=89189) :{"created":"@1638383895.368312000","description":"Error received from peer ipv4:127.0.0.1:65311","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective instance 223 expected type 0 and data_type 1 but got type 0 and data_type 9","grpc_status":13}
1/15 [=>............................] - ETA: 13s - loss: 2.2937 - accuracy: 0.1152
When I do Ctrl+C
first time it gives me this:
^CTraceback (most recent call last):
File "training_reproducer.py", line 188, in <module>
"epochs": 2
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 299, in run
for intermediate_result in iterator:
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 665, in __next__
next_results = self._run_with_error_handling(self._fetch_next_result)
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 638, in _run_with_error_handling
return func()
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 691, in _fetch_next_result
self._backend_executor_actor.get_next_results.remote())
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1721, in get
object_refs, timeout=timeout)
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 354, in get_objects
object_refs, self.current_task_id, timeout_ms)
File "python/ray/_raylet.pyx", line 1168, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx", line 153, in ray._raylet.check_status
KeyboardInterrupt
And for the second Ctrl+C
:
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/node.py", line 988, in _kill_process_type
wait=wait)
File "/Users/user/anaconda3/envs/ray/lib/python3.7/site-packages/ray/node.py", line 1040, in _kill_process_impl
process.wait(timeout_seconds)
File "/Users/user/anaconda3/envs/ray/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/Users/user/anaconda3/envs/ray/lib/python3.7/subprocess.py", line 1647, in _wait
time.sleep(delay)
Have a feeling that there is something wrong in the synchronization between the workers checkpointing and processing - sort of getting into deadlock.
Versions / Dependencies
python - 3.7.10; 3.8 ray - 1.8; 1.9rc2; nightly-build (2.0-dev) tensorflow - 2.7
I tested it on CentO7 (7 - L3.10 Intel® Xeon® Gold 6140 CPU @ 2.30GHz - Skylake 36 cores) And on Mac book Pro (BigSur 11.6 Mid 2014 - 2.8 GHz Quad-Core Intel Core i7 Haswell 4 cores)
Honestly - I hoped that it would work at least at Mac
Reproduction script
import numpy as np
def mnist_dataset(batch_size):
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
# The `x` arrays are in uint8 and have values in the [0, 255] range.
# You need to convert them to float32 with values in the [0, 1] range.
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(60000).repeat(1).batch(batch_size)
return train_dataset
def build_and_compile_cnn_model():
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(28, 28)),
tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=['accuracy'])
return model
import json
import os
from ray import train
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
os.environ["TRAIN_RESULT_ENABLE_DETAILED_AUTOFILLED_METRICS"] = "1"
import tensorflow as tf
from tensorflow.keras.callbacks import Callback
def train_func():
"""
Single worker training
"""
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
batch_size = 64
single_worker_dataset = mnist.mnist_dataset(batch_size)
single_worker_model = mnist.build_and_compile_cnn_model()
single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70)
class TrainReportCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
train.report(**logs)
def train_func_distributed(config):
"""
Multiworker training
"""
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
batch_size = config.get("batch_size", 64)
epochs = config.get("epochs", 3)
steps_per_epoch = config.get("epochs", 70)
per_worker_batch_size = batch_size
# This environment variable will be set by Ray Train.
tf_config = json.loads(os.environ['TF_CONFIG'])
num_workers = len(tf_config['cluster']['worker'])
strategy = tf.distribute.MultiWorkerMirroredStrategy()
global_batch_size = per_worker_batch_size * num_workers
multi_worker_dataset = mnist_dataset(global_batch_size)
with strategy.scope():
# Model building/compiling need to be within `strategy.scope()`.
multi_worker_model = build_and_compile_cnn_model()
#multi_worker_model.fit(multi_worker_dataset, epochs=epochs, steps_per_epoch=steps_per_epoch)
results = []
checkpoint = train.load_checkpoint() or {}
model = checkpoint.get("model", multi_worker_model)
start_epoch = checkpoint.get("epoch", -1) + 1
for epoch in range(start_epoch, epochs):
history = model.fit(
multi_worker_dataset,
callbacks=[TrainReportCallback()]
)
# COMMENT THIS OUT TO FIX THE HANGING
train.save_checkpoint(epoch=epoch,model=model)
results.append(history.history)
return results
if __name__ == "__main__":
import argparse
try:
import ray
except:
# Just train using basic method
train_func()
exit(0)
from ray.train import Trainer
parser = argparse.ArgumentParser()
parser.add_argument(
"--address",
required=False,
type=str,
help="the address to use for Ray")
parser.add_argument(
"--num-workers",
"-n",
type=int,
default=4,
help="Sets number of workers for training.")
parser.add_argument(
"--use-gpu",
action="store_true",
default=False,
help="Enables GPU training")
parser.add_argument(
"--smoke-test",
action="store_true",
default=False,
help="Finish quickly for testing.")
parser.add_argument(
"--ray-log-dir",
"-d",
nargs='?',
default=os.getcwd(),
type=str,
required=False,
help="Log Dir to store atrifacts of training (defult `pwd`).")
args, _ = parser.parse_known_args()
if args.smoke_test:
# 1 for datasets
num_cpus = args.num_workers + 1
num_gpus = args.num_workers if args.use_gpu else 0
ray.init(num_cpus=num_cpus, num_gpus=num_gpus)
else:
ray.init(address=args.address)
LOGDIR = args.ray_log_dir
# train using all workers from ray
print("Set trainer")
trainer = Trainer(
backend="tensorflow",
num_workers=args.num_workers,
logdir=LOGDIR,
use_gpu=args.use_gpu
)
print("Start trainer")
trainer.start()
print("Run trainer")
results = trainer.run(
train_func_distributed,
config={
"lr": 1e-3,
"batch_size": 1024,
"epochs": 2
}
)
print("Shutdown trainer")
trainer.shutdown()
print(f"Results: {results[0]}")
print(trainer.latest_checkpoint)
print(trainer.latest_checkpoint_dir)
print(trainer.latest_checkpoint_path)
Anything else
I run it like this:
python training_reproducer.py --smoke-test
Please note that I’m doing CPU only training here. This is intensional since I’m trying to scale it across many CPU-nodes.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
@amogkam or @matthewdeng are the right folks to take this issue.
I’ll assign it to both of them for them to triage for now.
@thoth291 yep totally agreed, I just created an issue to track this documentation update.
Marking this issue as closed, please open another issue if you run into any problems with Tune!