[BigScience176B] Model conversion from Megatron-LM to transformers
See original GitHub issueFeature request
Creating here a thread for the model conversion of the BigScience-176B model from Megatron-LM to the transformers library. I will summarize here what I have done so far, the current status of the conversion procedure, as well as a summary of the ‘discoveries’ that I have made and the small details that we have to care about when dealing with the conversion procedure. I did my work by forking @thomwolf 's fork. The tests has been done on the DGX machine (4 NVIDIA A100).
🌸 Big picture
- Generate some samples with a recent checkpoint
- Testing the exactness of the logits / hidden states values that we obtain using the same input, between the model from Megatron-LM and the converted model. We use a small GPT2 trained on a dummy dataset (2 sentences). This model has been pushed on the hub and being used for integration tests.
- Apply these tests on a recent checkpoint of the 176B model to make sure about the robustness of the tests
📎 Main links:
- First PR: thomwolf/transformers#1
- WIP PR: thomwolf/transformers#2
- Final PR: #16514
- The Small debug-GPT2 model
🔨 Current status
- For now, all tests pass on the DGX’s GPU (using different conda environments between the Megatron-LM model & the transformers model) with
assertEqual
. - The tests does not pass with
assertEqual
when running them on the CPU, but they pass withassertAlmostEqual
, with a tolerance of (0.05) for the logits after theLayerNorm
on the Embedding layer and a tolerance1e-06
on the final logits. Check the tests here. This behavior of non-exactness seem to be expected and we can not do much about that according to pytorch/pytorch#76052 - Added simple reconstruction and encoding tests on the BigScience tokenizer
📌 Tips for conversion
- Explicitly specify the dtype of your modules when initializing them seem to be helpful to ensure exact reproducibility - added an argument
dtype
on the config file - Concatenating the weights from Row-parallelized weights seem to return unsimilar results, I made a reproducible script and raised an issue pytorch/pytorch#76232 the solution for now is to manually aggregate the results across each TP rank. Needs further investigation for possible improvement of the conversion.
✅ Next steps
- Fix integration tests on the PR thomwolf/transformers#2
- Define which checkpoint to use for the next tests
- Convert the model with the selected checkpoints and compare the hidden states values between the 2 models. -> fixed some issues in this new commit a4fa70c1a5042fdca7d0fbf26b0aad6ca99fdadc
-
MixedFusedLayerNorm
andFusedScaledSoftmax
seem to be replaceable respectively byLayerNorm
andSoftmax
fromtorch.nn
. Verify this assumption on the new checkpoints. - Convert a sharded version of the large model and try the tests on that
cc @thomwolf @suzana-ilic @thomasw21 @stas00
Motivation
The feature request is related to the BigScience workshop, where a large Language Model is currently being trained using Megatron-LM.
Your contribution
Ultimately submitting a PR to add the BigScience-176B model to the transformers library - by ensuring the exactness of the operations between the converted model and the original trained model on Megatron-LM.
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:6 (5 by maintainers)
As a side note, I’m working on a solution to do model parallelism/offload while maximizing the GPU(s) memory/RAM available which should be useful to run this model on all kinds of setups (albeit more slowly). Should land in Accelerate in the coming weeks 😃
Also I think the big picture is we want to generate ASAP. (perhaps even before checking exactitude of conversion :S)