question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deepspeed's stdout is not accurate for number of steps, it doesn't take into account batch size. Also, `steps_per_print` doesn't print out loss.

See original GitHub issue

I am plugging my dataset into deepspeed.initialize

 deepspeed_config = {
        # "train_batch_size": 12, 
        "train_micro_batch_size_per_gpu": 17,
        "gradient_accumulation_steps": 64, 
...
...
...
        "steps_per_print": 30,
        "output_file": "/home/santosh/output_files/"
    }

    model_engine, optimizer, ds_training_dataloader, lr_scheduler = deepspeed.initialize(model=model,
                                                        model_parameters=optimizer_grouped_parameters,
                                                        lr_scheduler=scheduler, 
                                                        # training_data=DeepSpeedTrainDataset(),
                                                        collate_fn=collate_ds, 
                                                        config_params=deepspeed_config)

    def gpu_map(inn):
        return inn.to(model_engine.local_rank)

    num_epochs = 3
    for ii in range(num_epochs):
        for iii, model_batch in enumerate(tqdm(ds_training_dataloader)):
            
            model_batch = valmap(gpu_map, model_batch) 
            loss = model_engine(model_batch)
            model_engine.backward(loss)
            model_engine.step()

I am using 4 gpus and my total dataset size is 26516300, so there will be 26516300/4=6629075 samples per gpu.

And my batchsize is 17, so there will be 389945.59 steps.

However, the output shows something like

  1%|▉                                                                                                                                                                      | 37529/6629075 [7:00:35<1245:24:02,  1.47it/s]
[2021-06-12 05:55:39,929] [INFO] [timer.py:154:stop] 0/37530, SamplesPerSec=101.18682135866746
  1%|▉                                                                                                                                                                      | 37560/6629075 [7:00:57<1249:32:33,  1.47it/s][2021-06-12 05:56:00,892] [INFO] [timer.py:154:stop] 0/37560, SamplesPerSec=101.18363612937223
  1%|▉                                                                                                                                                                      | 37590/6629075 [7:01:18<1244:21:01,  1.47it/s]
[2021-06-12 05:56:21,353] [INFO] [timer.py:154:stop] 0/37590, SamplesPerSec=101.1824693564067
  1%|▉                                                                                                                                                                      | 37600/6629075 [7:01:24<1260:42:39,  1.45it/s]

No matter what I change the batchsize to, it always shows the steps out of /6629075.

Also, I think that since I specified "steps_per_print": 30,, there should be a loss printout every 30 steps, but all that’s printed out is

[7:00:57<1249:32:33, 1.47it/s][2021-06-12 05:56:00,892] [INFO] [timer.py:154:stop] 0/37560, SamplesPerSec=101.18363612937223

Which doesn’t show the loss. I did a run where I printed out the loss directly, just to make sure the loss wasn’t nan or anything like that.

A workaround is printing out the loss directly, but this creates a new like in the output. Also, whatever is being printed out in steps_per_print also creates a new line. This is NIT but would prefer to keep a single line with the step and running loss.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
tjruwasecommented, Jun 16, 2021

@Santosh-Gupta, the docs bug about steps_per_print is now fixed. Can this issue be closed now? As you explore the dataloader usage, please feel free to open an issue as appropriate.

0reactions
Santosh-Guptacommented, Jun 16, 2021

Great, I am thinking all the requirements needed for this. I think the data needs to be partitioned for the number of gpus, and then the model_engine.rank can be used with a condition on which partition to receive. Finally, the data would need to be send .to('cuda:'+str(model_engine.rank)). Does this sound right? Or am I missing something.

The torch dataloader already handles much of these issues, especially the multi-gpu partitioning. You might find this link useful.

It seems that DistributedSampler is the way to go, specifying rank, num_replicas, and the dataset on initialization.

That’s how it was handled in this example

https://github.com/microsoft/DeepSpeedExamples/blob/25d73cf73fb3dc66faefa141b7319526555be9fc/MoQ/huggingface-transformers/src/transformers/sagemaker/trainer_sm.py#L92

However, in this example, only the dataset is provided

https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/nvidia_run_squad_deepspeed.py#L984

Are the data for the squad example being partitioned among the GPUs? Or each each GPU getting the entire dataset?

EDIT:

It looks like the Deepspeed dataloader just sets it up the exact way as I was planning to use it, so it looks like this is a pytorch issue and not a deepspeed issue.

https://github.com/microsoft/DeepSpeed/blob/c5700bc09f6ca9ed0a59009d34af7c0d4b9f16b7/deepspeed/runtime/dataloader.py#L50

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] Same train time with DeepSpeed (despite increased ...
After I applied deepspeed, I could increase the batch size (64 -> 128, but OOM with 256) of training model so I expected...
Read more >
DeepSpeed Configuration JSON
The effective training batch size. This is the amount of data samples that leads to one step of model update. train_batch_size is aggregated...
Read more >
Batch size in trainer eval loop - Hugging Face Forums
I am new to huggingface trainer. I tried to use hf trainer on t5. It looks to me that the training phase uses...
Read more >
What if the validation step does not fit into numbers of samples?
The variable-length should be set to the number of samples. Then the code determines a batch size and steps per epoch such that...
Read more >
Training Large Neural Networks with Constant Memory using ...
memory size is influenced not only by the model parameters but also by a sufficiently large batch size required for convergence.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found