Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

InvalidArgumentError: Requires start <= limit when delta > 0 when trying to distribute training

See original GitHub issue

Setup:

TF: v0.4.0
TF: 2.4.1
Attempting to run my prototype on 3 machines in aws, of type m5.4xlarge.

The TQDM progress bar never moves. Execution appears hanging, also, with progress bar code removed.

Things attempted:

I was able to run the MNIST example with the same MultiWorkerMirroredStrategy approach.
Tried compile with Adagrad vs. Adam. Tried a small batch size such as 192 vs. larger such as 8192. No difference for my prototype.

Can I get any output out of TFRS at this point to tell what might be going wrong? Any other things to try?

Output on the command line:

2021-04-12 21:44:02.417762: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-04-12 21:44:03.666561: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.667485: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-12 21:44:03.739716: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-12 21:44:03.739767: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-10-2-249-213.awsinternal.audiomack.com): /proc/driver/nvidia/version does not exist
2021-04-12 21:44:03.740798: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-12 21:44:03.741015: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.741571: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.745589: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.2.249.213:2121, 1 -> 10.2.252.56:2121, 2 -> 10.2.252.97:2121}
2021-04-12 21:44:03.745902: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://10.2.249.213:2121

>> 2021-04-12 16:44:07 : >> Running the prototype...

>> Initializing TfrsModelMaker...
>> items_path  : s3://my-bucket/recsys-tf/temp-data/20210411173323/items
>> users_path  : s3://my-bucket/recsys-tf/temp-data/20210411173323/users
>> events_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/events
>> num_items   : 100
>> num_users   : 100
>> num_events  : 100

2021-04-12 21:44:08.693713: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2021-04-12 21:44:08.693900: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA

>> Strategy: <tensorflow.python.distribute.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f8ab9d67690>
>> 2021-04-12 16:44:09 : >> Training the model...

2021-04-12 21:44:09.449931: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-04-12 21:44:09.467194: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
  0%|                                                                                                                            | 0/3 [00:00<?, ?epoch/s]
0.00batch [00:00, ?batch/s]

Issue Analytics

State:
Created 2 years ago
Comments:50

Top GitHub Comments

1reaction

maciejkulacommented, May 13, 2021

This does seem to be the same problem. I’m sorry you’re running into this, it does look like a pain.

0reactions

dgoldenberg-audiomackcommented, May 25, 2021

Basically the same behavior as with the default strategy, goes into training. Hasn’t finished yet but doesn’t look unhappy.

    strategy = None

    if strategy is not None:
        with strategy.scope():
            model, opt_test_events_ds, opt_train_events_ds = self.do_create_model(embedding_dimension)
     else:
        model, opt_test_events_ds, opt_train_events_ds = self.do_create_model(embedding_dimension)

Top Results From Across the Web

InvalidArgumentError on softmax in tensorflow - Stack Overflow

The problem arises because you call tf.reduce_sum on the argument of tf.nn.softmax . As a result, the softmax function fails because a ...

tf.errors.InvalidArgumentError | TensorFlow v2.11.0

This error is typically raised when an op receives mismatched arguments. Example: tf.reshape ...

Diff - refs/tags/v1.15.0^2..refs/tags/v1.15.0 - Google Git

Defaults to 7200 seconds, but users may + want to set a lower value to ... + return errors::InvalidArgument("Requires start <= limit when...

BMS – VisionTransformer+Transformer [ViT] - Kaggle

The Amazing Mark Wijkhuizen's TPU Training Notebook For This Competition ... self.positions = tf.reshape(tf.range(start=0, limit=self.num_patches, delta=1), ...

Source code for lingvo.core.py_utils

CheckEMA(p.task.name, p.task) if p.train.ema_decay > 0: return tf.train. ... Samples are drawn from a uniform distribution within [-limit, limit], ...