Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer not logging to WandB in SageMaker

See original GitHub issue

transformers version: 4.3.0
wandb version: 0.10.20
Platform: SageMaker hosted training with PyTorch estimator.
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

@stas00 @sgugger

I am using a SageMaker training environment to train BertForSequenceClassification. To do this, I’m passing the model into a Trainer instance and calling trainer.train().

To train in SageMaker, I am using a PyTorch estimator:

estimator = PyTorch(
                    entry_point='train_classifier.py',
                    source_dir='./',
                    role=role,
                    sagemaker_session=sagemaker_session,
                    hyperparameters=hp,
                    subnets=subnets,
                    security_group_ids=sec_groups,
                    framework_version='1.6.0',
                    py_version='py3',
                    instance_count=1,
                    instance_type=instance_type,
                    dependencies=[ '../lib', '../db_conn'],
                    use_spot_instances=False,
                    volume_size=100,
                    #max_wait=max_wait_time_secs
                    )
estimator.fit()

I have tried this with different p2 and p3 instances.

In EC2 or in a SageMaker notebook, this does automated logging of training loss and evaluation loss and metrics to WandB. With the estimator, I get no training logs.

Anything that I manually log to WandB appears in my dashboard. The only info that doesn’t show up is whatever used to get logged by the Trainer.

I tried os.environ["WANDB_DISALBED"] = "false" in my training script, no luck.

Issue Analytics

State:
Created 3 years ago
Comments:27 (14 by maintainers)

Top GitHub Comments

1reaction

alexf-acommented, Mar 9, 2021

Yup

0reactions

github-actions[bot]commented, May 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.