question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BigScience176B] Model conversion from Megatron-LM to transformers

See original GitHub issue

Feature request

Creating here a thread for the model conversion of the BigScience-176B model from Megatron-LM to the transformers library. I will summarize here what I have done so far, the current status of the conversion procedure, as well as a summary of the ‘discoveries’ that I have made and the small details that we have to care about when dealing with the conversion procedure. I did my work by forking @thomwolf 's fork. The tests has been done on the DGX machine (4 NVIDIA A100).

🌸 Big picture

  • Generate some samples with a recent checkpoint
  • Testing the exactness of the logits / hidden states values that we obtain using the same input, between the model from Megatron-LM and the converted model. We use a small GPT2 trained on a dummy dataset (2 sentences). This model has been pushed on the hub and being used for integration tests.
  • Apply these tests on a recent checkpoint of the 176B model to make sure about the robustness of the tests

📎 Main links:

🔨 Current status

  • For now, all tests pass on the DGX’s GPU (using different conda environments between the Megatron-LM model & the transformers model) with assertEqual.
  • The tests does not pass with assertEqual when running them on the CPU, but they pass with assertAlmostEqual, with a tolerance of (0.05) for the logits after the LayerNorm on the Embedding layer and a tolerance 1e-06 on the final logits. Check the tests here. This behavior of non-exactness seem to be expected and we can not do much about that according to pytorch/pytorch#76052
  • Added simple reconstruction and encoding tests on the BigScience tokenizer

📌 Tips for conversion

  • Explicitly specify the dtype of your modules when initializing them seem to be helpful to ensure exact reproducibility - added an argument dtype on the config file
  • Concatenating the weights from Row-parallelized weights seem to return unsimilar results, I made a reproducible script and raised an issue pytorch/pytorch#76232 the solution for now is to manually aggregate the results across each TP rank. Needs further investigation for possible improvement of the conversion.

✅ Next steps

  • Fix integration tests on the PR thomwolf/transformers#2
  • Define which checkpoint to use for the next tests
  • Convert the model with the selected checkpoints and compare the hidden states values between the 2 models. -> fixed some issues in this new commit a4fa70c1a5042fdca7d0fbf26b0aad6ca99fdadc
  • MixedFusedLayerNorm and FusedScaledSoftmax seem to be replaceable respectively by LayerNorm and Softmax from torch.nn. Verify this assumption on the new checkpoints.
  • Convert a sharded version of the large model and try the tests on that

cc @thomwolf @suzana-ilic @thomasw21 @stas00

Motivation

The feature request is related to the BigScience workshop, where a large Language Model is currently being trained using Megatron-LM.

Your contribution

Ultimately submitting a PR to add the BigScience-176B model to the transformers library - by ensuring the exactness of the operations between the converted model and the original trained model on Megatron-LM.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:3
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
sguggercommented, Apr 25, 2022

As a side note, I’m working on a solution to do model parallelism/offload while maximizing the GPU(s) memory/RAM available which should be useful to run this model on all kinds of setups (albeit more slowly). Should land in Accelerate in the coming weeks 😃

1reaction
thomasw21commented, Apr 25, 2022

Also I think the big picture is we want to generate ASAP. (perhaps even before checking exactitude of conversion :S)

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to train a Language Model with Megatron-LM
Model conversion to Transformers. Why Megatron-LM? Before getting into the training details, let's first understand what makes this ...
Read more >
Megatron-LM: Training Multi-Billion Parameter Language ...
In this work, we present our techniques for train- ing very large transformer models and implement a simple, efficient intra-layer model parallel ap-...
Read more >
Announcing Megatron for Training Trillion Parameter Models ...
NVIDIA Megatron is a PyTorch-based framework for training giant language models based on the transformer architecture.
Read more >
Ultimate Guide To Scaling ML Models - Megatron-LM - YouTube
Sign up for AssemblyAI's speech API using my link ...
Read more >
Efficient Large-Scale Language Model Training on GPU ...
Transformer -based language models [13, 27, 33–35, 42, 46] in Nat- ... (available at https://github.com/nvidia/megatron-lm) will enable.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found