Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deepspeed's stdout is not accurate for number of steps, it doesn't take into account batch size. Also, `steps_per_print` doesn't print out loss.

See original GitHub issue

I am plugging my dataset into deepspeed.initialize

 deepspeed_config = {
        # "train_batch_size": 12, 
        "train_micro_batch_size_per_gpu": 17,
        "gradient_accumulation_steps": 64, 
...
...
...
        "steps_per_print": 30,
        "output_file": "/home/santosh/output_files/"
    }

    model_engine, optimizer, ds_training_dataloader, lr_scheduler = deepspeed.initialize(model=model,
                                                        model_parameters=optimizer_grouped_parameters,
                                                        lr_scheduler=scheduler, 
                                                        # training_data=DeepSpeedTrainDataset(),
                                                        collate_fn=collate_ds, 
                                                        config_params=deepspeed_config)

    def gpu_map(inn):
        return inn.to(model_engine.local_rank)

    num_epochs = 3
    for ii in range(num_epochs):
        for iii, model_batch in enumerate(tqdm(ds_training_dataloader)):
            
            model_batch = valmap(gpu_map, model_batch) 
            loss = model_engine(model_batch)
            model_engine.backward(loss)
            model_engine.step()

I am using 4 gpus and my total dataset size is 26516300, so there will be 26516300/4=6629075 samples per gpu.

And my batchsize is 17, so there will be 389945.59 steps.

However, the output shows something like

  1%|▉                                                                                                                                                                      | 37529/6629075 [7:00:35<1245:24:02,  1.47it/s]
[2021-06-12 05:55:39,929] [INFO] [timer.py:154:stop] 0/37530, SamplesPerSec=101.18682135866746
  1%|▉                                                                                                                                                                      | 37560/6629075 [7:00:57<1249:32:33,  1.47it/s][2021-06-12 05:56:00,892] [INFO] [timer.py:154:stop] 0/37560, SamplesPerSec=101.18363612937223
  1%|▉                                                                                                                                                                      | 37590/6629075 [7:01:18<1244:21:01,  1.47it/s]
[2021-06-12 05:56:21,353] [INFO] [timer.py:154:stop] 0/37590, SamplesPerSec=101.1824693564067
  1%|▉                                                                                                                                                                      | 37600/6629075 [7:01:24<1260:42:39,  1.45it/s]

No matter what I change the batchsize to, it always shows the steps out of /6629075.

Also, I think that since I specified "steps_per_print": 30,, there should be a loss printout every 30 steps, but all that’s printed out is

[7:00:57<1249:32:33, 1.47it/s][2021-06-12 05:56:00,892] [INFO] [timer.py:154:stop] 0/37560, SamplesPerSec=101.18363612937223

Which doesn’t show the loss. I did a run where I printed out the loss directly, just to make sure the loss wasn’t nan or anything like that.

A workaround is printing out the loss directly, but this creates a new like in the output. Also, whatever is being printed out in steps_per_print also creates a new line. This is NIT but would prefer to keep a single line with the step and running loss.

Issue Analytics

State:
Created 2 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

tjruwasecommented, Jun 16, 2021

@Santosh-Gupta, the docs bug about steps_per_print is now fixed. Can this issue be closed now? As you explore the dataloader usage, please feel free to open an issue as appropriate.

0reactions

Santosh-Guptacommented, Jun 16, 2021

Great, I am thinking all the requirements needed for this. I think the data needs to be partitioned for the number of gpus, and then the model_engine.rank can be used with a condition on which partition to receive. Finally, the data would need to be send .to('cuda:'+str(model_engine.rank)). Does this sound right? Or am I missing something.

The torch dataloader already handles much of these issues, especially the multi-gpu partitioning. You might find this link useful.

It seems that DistributedSampler is the way to go, specifying rank, num_replicas, and the dataset on initialization.

That’s how it was handled in this example

https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/MoQ/huggingface-transformers/src/transformers/sagemaker/trainer_sm.py#L92

However, in this example, only the dataset is provided

https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/nvidia_run_squad_deepspeed.py#L984

Are the data for the squad example being partitioned among the GPUs? Or each each GPU getting the entire dataset?

EDIT:

It looks like the Deepspeed dataloader just sets it up the exact way as I was planning to use it, so it looks like this is a pytorch issue and not a deepspeed issue.

https://github.com/microsoft/DeepSpeed/blob/c5700bc09f6ca9ed0a59009d34af7c0d4b9f16b7/deepspeed/runtime/dataloader.py#L50