Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logging issue with 0.9.0 and current dev branch

See original GitHub issue

** Environment **

OS: Ubuntu 20.04
Hardware (GPU, or instance type): 8xA100
cuda: 11.3
cudnn: 8
pytorch: 1.12.1
composer: dev branch installed from source/0.9.0 installed from pip
transformers: 4.21.2

** To reproduce

I have the following definition of bloom model, mostly copied from the GPT2 definition within composer.

def create_bloom(
    model_name: str,
    tokenizer_name: str,
    use_pretrained: Optional[bool] = False,
    model_config: Optional[dict] = None,
    gradient_checkpointing: Optional[bool] = False,
) -> ComposerModel:

    if not model_config:
        model_config = {}

    if use_pretrained:
        model = transformers.AutoModelForCausalLM.from_pretrained(model_name, **model_config)
    else:
        config = transformers.AutoConfig.from_pretrained(model_name, **model_config)
        model = transformers.AutoModelForCausalLM.from_config(config)

    tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_name)

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    return HuggingFaceModel(model=model, tokenizer=tokenizer, metrics=[HFCrossEntropy(), Perplexity()])

There are 2 issues, one with the 0.9.0 release and the other with the dev branch.

Steps to reproduce the behavior:

Running LM training with grad accumulation with 0.9.0 doesn’t plot HF metrics in wandb, but has correct step counts while logging metrics.

You can see that the logs don’t show Perplexity and CrossEntropy metrics.

Running LM training with grad accumulation with the dev branch plots HF metrics but gets the step count while plotting these metrics completely wrong.

You can see metrics being plotted for 266 step with only 38 batches being trained.