Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BigScience176B] Model conversion from Megatron-LM to transformers

See original GitHub issue

Feature request

Creating here a thread for the model conversion of the BigScience-176B model from Megatron-LM to the transformers library. I will summarize here what I have done so far, the current status of the conversion procedure, as well as a summary of the ‘discoveries’ that I have made and the small details that we have to care about when dealing with the conversion procedure. I did my work by forking @thomwolf 's fork. The tests has been done on the DGX machine (4 NVIDIA A100).

🌸 Big picture

Generate some samples with a recent checkpoint
Testing the exactness of the logits / hidden states values that we obtain using the same input, between the model from Megatron-LM and the converted model. We use a small GPT2 trained on a dummy dataset (2 sentences). This model has been pushed on the hub and being used for integration tests.
Apply these tests on a recent checkpoint of the 176B model to make sure about the robustness of the tests

📎 Main links:

First PR: thomwolf/transformers#1
WIP PR: thomwolf/transformers#2
Final PR: #16514
The Small debug-GPT2 model

🔨 Current status

For now, all tests pass on the DGX’s GPU (using different conda environments between the Megatron-LM model & the transformers model) with assertEqual.
The tests does not pass with assertEqual when running them on the CPU, but they pass with assertAlmostEqual, with a tolerance of (0.05) for the logits after the LayerNorm on the Embedding layer and a tolerance 1e-06 on the final logits. Check the tests here. This behavior of non-exactness seem to be expected and we can not do much about that according to pytorch/pytorch#76052
Added simple reconstruction and encoding tests on the BigScience tokenizer

📌 Tips for conversion

Explicitly specify the dtype of your modules when initializing them seem to be helpful to ensure exact reproducibility - added an argument dtype on the config file
Concatenating the weights from Row-parallelized weights seem to return unsimilar results, I made a reproducible script and raised an issue pytorch/pytorch#76232 the solution for now is to manually aggregate the results across each TP rank. Needs further investigation for possible improvement of the conversion.

✅ Next steps

Fix integration tests on the PR thomwolf/transformers#2
Define which checkpoint to use for the next tests
Convert the model with the selected checkpoints and compare the hidden states values between the 2 models. -> fixed some issues in this new commit a4fa70c1a5042fdca7d0fbf26b0aad6ca99fdadc
MixedFusedLayerNorm and FusedScaledSoftmax seem to be replaceable respectively by LayerNorm and Softmax from torch.nn. Verify this assumption on the new checkpoints.
Convert a sharded version of the large model and try the tests on that

cc @thomwolf @suzana-ilic @thomasw21 @stas00

Motivation

The feature request is related to the BigScience workshop, where a large Language Model is currently being trained using Megatron-LM.

Your contribution

Ultimately submitting a PR to add the BigScience-176B model to the transformers library - by ensuring the exactness of the operations between the converted model and the original trained model on Megatron-LM.

Issue Analytics

State:
Created a year ago
Reactions:3
Comments:6 (5 by maintainers)

Top GitHub Comments

3reactions

sguggercommented, Apr 25, 2022

As a side note, I’m working on a solution to do model parallelism/offload while maximizing the GPU(s) memory/RAM available which should be useful to run this model on all kinds of setups (albeit more slowly). Should land in Accelerate in the coming weeks 😃

1reaction

thomasw21commented, Apr 25, 2022

Also I think the big picture is we want to generate ASAP. (perhaps even before checking exactitude of conversion :S)

Top Results From Across the Web

How to train a Language Model with Megatron-LM

Model conversion to Transformers. Why Megatron-LM? Before getting into the training details, let's first understand what makes this ...

Megatron-LM: Training Multi-Billion Parameter Language ...

In this work, we present our techniques for train- ing very large transformer models and implement a simple, efficient intra-layer model parallel ap-...

Announcing Megatron for Training Trillion Parameter Models ...

NVIDIA Megatron is a PyTorch-based framework for training giant language models based on the transformer architecture.

Ultimate Guide To Scaling ML Models - Megatron-LM - YouTube

Efficient Large-Scale Language Model Training on GPU ...

Transformer -based language models [13, 27, 33–35, 42, 46] in Nat- ... (available at https://github.com/nvidia/megatron-lm) will enable.