InvalidArgumentError: Requires start <= limit when delta > 0 when trying to distribute training
See original GitHub issueSetup:
- TF: v0.4.0
- TF: 2.4.1
- Attempting to run my prototype on 3 machines in aws, of type m5.4xlarge.
The TQDM progress bar never moves. Execution appears hanging, also, with progress bar code removed.
Things attempted:
- I was able to run the MNIST example with the same MultiWorkerMirroredStrategy approach.
- Tried
compilewith Adagrad vs. Adam. Tried a small batch size such as 192 vs. larger such as 8192. No difference for my prototype.
Can I get any output out of TFRS at this point to tell what might be going wrong? Any other things to try?
Output on the command line:
2021-04-12 21:44:02.417762: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-04-12 21:44:03.666561: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.667485: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-12 21:44:03.739716: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-12 21:44:03.739767: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-10-2-249-213.awsinternal.audiomack.com): /proc/driver/nvidia/version does not exist
2021-04-12 21:44:03.740798: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-12 21:44:03.741015: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.741571: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.745589: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.2.249.213:2121, 1 -> 10.2.252.56:2121, 2 -> 10.2.252.97:2121}
2021-04-12 21:44:03.745902: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://10.2.249.213:2121
>> 2021-04-12 16:44:07 : >> Running the prototype...
>> Initializing TfrsModelMaker...
>> items_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/items
>> users_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/users
>> events_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/events
>> num_items : 100
>> num_users : 100
>> num_events : 100
2021-04-12 21:44:08.693713: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2021-04-12 21:44:08.693900: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
>> Strategy: <tensorflow.python.distribute.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f8ab9d67690>
>> 2021-04-12 16:44:09 : >> Training the model...
2021-04-12 21:44:09.449931: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-04-12 21:44:09.467194: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
0%| | 0/3 [00:00<?, ?epoch/s]
0.00batch [00:00, ?batch/s]
Issue Analytics
- State:
- Created 2 years ago
- Comments:50
Top Results From Across the Web
InvalidArgumentError on softmax in tensorflow - Stack Overflow
The problem arises because you call tf.reduce_sum on the argument of tf.nn.softmax . As a result, the softmax function fails because a ...
Read more >tf.errors.InvalidArgumentError | TensorFlow v2.11.0
This error is typically raised when an op receives mismatched arguments. Example: tf.reshape ...
Read more >Diff - refs/tags/v1.15.0^2..refs/tags/v1.15.0 - Google Git
Defaults to 7200 seconds, but users may + want to set a lower value to ... + return errors::InvalidArgument("Requires start <= limit when...
Read more >BMS – VisionTransformer+Transformer [ViT] - Kaggle
The Amazing Mark Wijkhuizen's TPU Training Notebook For This Competition ... self.positions = tf.reshape(tf.range(start=0, limit=self.num_patches, delta=1), ...
Read more >Source code for lingvo.core.py_utils
CheckEMA(p.task.name, p.task) if p.train.ema_decay > 0: return tf.train. ... Samples are drawn from a uniform distribution within [-limit, limit], ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

This does seem to be the same problem. I’m sorry you’re running into this, it does look like a pain.
Basically the same behavior as with the default strategy, goes into training. Hasn’t finished yet but doesn’t look unhappy.