question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer not logging to WandB in SageMaker

See original GitHub issue
  • transformers version: 4.3.0
  • wandb version: 0.10.20
  • Platform: SageMaker hosted training with PyTorch estimator.
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

@stas00 @sgugger

I am using a SageMaker training environment to train BertForSequenceClassification. To do this, I’m passing the model into a Trainer instance and calling trainer.train().

To train in SageMaker, I am using a PyTorch estimator:

estimator = PyTorch(
                    entry_point='train_classifier.py',
                    source_dir='./',
                    role=role,
                    sagemaker_session=sagemaker_session,
                    hyperparameters=hp,
                    subnets=subnets,
                    security_group_ids=sec_groups,
                    framework_version='1.6.0',
                    py_version='py3',
                    instance_count=1,
                    instance_type=instance_type,
                    dependencies=[ '../lib', '../db_conn'],
                    use_spot_instances=False,
                    volume_size=100,
                    #max_wait=max_wait_time_secs
                    )
estimator.fit()

I have tried this with different p2 and p3 instances.

In EC2 or in a SageMaker notebook, this does automated logging of training loss and evaluation loss and metrics to WandB. With the estimator, I get no training logs.

Anything that I manually log to WandB appears in my dashboard. The only info that doesn’t show up is whatever used to get logged by the Trainer.

I tried os.environ["WANDB_DISALBED"] = "false" in my training script, no luck.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:27 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
alexf-acommented, Mar 9, 2021

Yup

0reactions
github-actions[bot]commented, May 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SageMaker - Documentation - Weights & Biases - Wandb
W&B looks for a file named secrets.env relative to the training script and loads them into the environment when wandb.init() is called.
Read more >
jambran/wandb_sagemaker_bug_report: Minimal code to ... - GitHub
I'm trying to train on sagemaker, but I can't get a successful training job to complete. I can remove the wandb logging code,...
Read more >
Technical FAQ · GitBook
When wandb.init() is called from your training script an API call is made to ... Calling wandb.log writes a line to a local...
Read more >
"No space left on device" when using HuggingFace + ...
I'm not sure what is triggering this problem because the volume size ... using a HuggingFace estimator in SageMaker pipelines training job.
Read more >
AWS SageMaker Experiments with Weights and Biases
Problem Statement; Dataset; Set up the experiment; Track experiment; Accessing Training Metrics using Experiments UI from SageMaker Studio ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found