Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pretraining of albert from scratch is stuck

See original GitHub issue

I am doing pre-training from scratch. It seems that training is started as gpu’s are being used but nothing is on terminal except this:

***** Number of cores used :  4 
I0227 09:00:31.841020 140137372948224 run_pretraining.py:226] Training using customized training loop TF 2.0 with distrubutedstrategy.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:44.563593 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:44.569019 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:45.620952 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:45.625989 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.679141 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.684157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.734523 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.739573 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.697876 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.703157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:07.835676 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:28.672055 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-02-27 09:01:50.162839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

I tried on smaller text data also but same results. @kamalkraj

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:8

Top GitHub Comments

1reaction

josegchencommented, Mar 4, 2020

It does show minor gpu utilize, see 1-2% for 7-8 GPUs and 25% for a single GPU for a very very short moment. However, the GPU memory are occupied. It halted eventually with a resource exhausted error.

On Mar 4, 2020, at 1:45 AM, Karan Purohit notifications@github.com wrote:

have you checked gpu usage? In my case, gpu is utilizing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/36?email_source=notifications&email_token=AHIXECYLI3ANUAN22PXWVSLRFYBLNA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWWXBI#issuecomment-594373509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIXECZEP4VVQXAZRKXJSATRFYBLNANCNFSM4K4XMSMA.

1reaction

josegchencommented, Mar 3, 2020

I have tried with an 313MB tf_record file, it works on CPU only.

On Mar 3, 2020, at 7:56 AM, Karan Purohit notifications@github.com wrote:

python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/36?email_source=notifications&email_token=AHIXEC7YQT3EV2OWCZOREMLRFUEAPA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSMLY#issuecomment-593962543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIXEC2PHQSGCQXKPKSEM23RFUEAPANCNFSM4K4XMSMA.