Deepspeed's stdout is not accurate for number of steps, it doesn't take into account batch size. Also, `steps_per_print` doesn't print out loss.
See original GitHub issueI am plugging my dataset into deepspeed.initialize
deepspeed_config = {
# "train_batch_size": 12,
"train_micro_batch_size_per_gpu": 17,
"gradient_accumulation_steps": 64,
...
...
...
"steps_per_print": 30,
"output_file": "/home/santosh/output_files/"
}
model_engine, optimizer, ds_training_dataloader, lr_scheduler = deepspeed.initialize(model=model,
model_parameters=optimizer_grouped_parameters,
lr_scheduler=scheduler,
# training_data=DeepSpeedTrainDataset(),
collate_fn=collate_ds,
config_params=deepspeed_config)
def gpu_map(inn):
return inn.to(model_engine.local_rank)
num_epochs = 3
for ii in range(num_epochs):
for iii, model_batch in enumerate(tqdm(ds_training_dataloader)):
model_batch = valmap(gpu_map, model_batch)
loss = model_engine(model_batch)
model_engine.backward(loss)
model_engine.step()
I am using 4 gpus and my total dataset size is 26516300, so there will be 26516300/4=6629075 samples per gpu.
And my batchsize is 17, so there will be 389945.59 steps.
However, the output shows something like
1%|▉ | 37529/6629075 [7:00:35<1245:24:02, 1.47it/s]
[2021-06-12 05:55:39,929] [INFO] [timer.py:154:stop] 0/37530, SamplesPerSec=101.18682135866746
1%|▉ | 37560/6629075 [7:00:57<1249:32:33, 1.47it/s][2021-06-12 05:56:00,892] [INFO] [timer.py:154:stop] 0/37560, SamplesPerSec=101.18363612937223
1%|▉ | 37590/6629075 [7:01:18<1244:21:01, 1.47it/s]
[2021-06-12 05:56:21,353] [INFO] [timer.py:154:stop] 0/37590, SamplesPerSec=101.1824693564067
1%|▉ | 37600/6629075 [7:01:24<1260:42:39, 1.45it/s]
No matter what I change the batchsize to, it always shows the steps out of /6629075
.
Also, I think that since I specified "steps_per_print": 30,
, there should be a loss printout every 30 steps, but all that’s printed out is
[7:00:57<1249:32:33, 1.47it/s][2021-06-12 05:56:00,892] [INFO] [timer.py:154:stop] 0/37560, SamplesPerSec=101.18363612937223
Which doesn’t show the loss. I did a run where I printed out the loss directly, just to make sure the loss wasn’t nan or anything like that.
A workaround is printing out the loss directly, but this creates a new like in the output. Also, whatever is being printed out in steps_per_print
also creates a new line. This is NIT but would prefer to keep a single line with the step and running loss.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (7 by maintainers)
Top GitHub Comments
@Santosh-Gupta, the docs bug about
steps_per_print
is now fixed. Can this issue be closed now? As you explore the dataloader usage, please feel free to open an issue as appropriate.It seems that
DistributedSampler
is the way to go, specifying rank, num_replicas, and the dataset on initialization.That’s how it was handled in this example
https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/MoQ/huggingface-transformers/src/transformers/sagemaker/trainer_sm.py#L92
However, in this example, only the dataset is provided
https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/nvidia_run_squad_deepspeed.py#L984
Are the data for the squad example being partitioned among the GPUs? Or each each GPU getting the entire dataset?
EDIT:
It looks like the Deepspeed dataloader just sets it up the exact way as I was planning to use it, so it looks like this is a pytorch issue and not a deepspeed issue.
https://github.com/microsoft/DeepSpeed/blob/c5700bc09f6ca9ed0a59009d34af7c0d4b9f16b7/deepspeed/runtime/dataloader.py#L50