question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[App]: Unsuccessful runs not being run in wandb sweep interface

See original GitHub issue

Current Behavior

I have only been using grid search in the wandb sweeps so the below issue is for grid search only, this could be happening for other hyperparameter search strategies but, I have not explored that yet.

  1. In the main sweep interface in the web app/website located here - https://wandb.ai/meghbhalerao/ddist/sweeps?workspace=user-meghbhalerao - this is specific to my username but can be generalized for other people too, the columns showing the run counts are buggy. The estimated runs is fine, since that is a fixed number throughout the sweep, which is the multiplication of the number of discrete values in case of a grid search. I will describe the issue/current behavior in the following points.
  2. Let us assume that for whatsoever reason a few runs in a sweep fail/crash/or are in general, unsuccessful. In such a case, I  would like to rerun those runs. So, what I do is, I go inside the wandb sweep, and I select the runs using the checkbox and I delete them by clicking the trashcan icon on the UI. The runs get deleted which is well and good and I can not see them anymore inside the sweep. And, I wouldn expect the wandb agents to take care of the rest since a wandb agent does the backend job of communicating with the central sweep server to check which runs to run next and what runs are yet to be run etc.
  3. Now, the issue arises when I go outside to the main sweep interface (https://wandb.ai/meghbhalerao/ddist/sweeps?workspace=user-meghbhalerao - same link as in point 1) - the run count does not decrease, it should decrease since I have deleted some runs as in point 2.
  4. It would have been not a very critical issue, if it was simply a display problem, but the issue is more involved. What is happening is that, since the run count is not decreasing even after deleting the crashed/failed aka unsuccessful runs, when the run count reaches the number of estimated runs, the sweep simply finishes without running runs for all the hyperparameter combinations. This means that some hyperparameter combinations remain untested even when the sweep has finished. I have verified that the sweep has finished since when I try to run the agent (from my local machine using wandb agent path/agent/) it says that the sweep has been completed.

Expected Behavior

The expected behavior should be that when I delete runs from inside the sweep (they could be any type of runs i.e. runs of any state, either crashed, failed or even successful runs for that matter) the run count must reduce on the sweep UI here (https://wandb.ai/meghbhalerao/ddist/sweeps?workspace=user-meghbhalerao - same link as above) and must actually reflect in the internal wandb agent central server which manages the runs. This was the deleted runs are actually ‘forgotten’ by the central sweep server so that they can be rerun by the wandb agents.

Steps To Reproduce

  1. Create a minimal sweep using any yaml confilg file using the following command line - wandb sweep sweep.yaml.
  2. Run single or multiple agents on the command line using wandb agent <USERNAME/PROJECTNAME/SWEEPID>
  3. Before the runs have been completed, kill one or multiple processes launched by these agents.
  4. Then go the sweep UI of this sweep (as defined above), and go to the sweep table section on the left side, and delete the runs (ideally, these could be any of the runs, but for now delete the failed runs, there should be some runs which are not running nor completed since you have killed them before completion) manually using the checkboxes.
  5. Go to the main sweep UI which shows all the sweeps that you have. You should see that the run count has not been reduced despite the fact that you have deleted the runs from inside the sweep.
  6. Also, you can observe that the sweep finishes when the run count equals the estimated runs.
  7. Point 6 results in not all the runs being swept over by a given sweep.

Screenshots

No response

Environment

OS: Linux CentOS 7 and Ubuntu 18.04.5

Browsers: Safari

Version: -

Additional Context

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:11

github_iconTop GitHub Comments

1reaction
MBakirWBcommented, Nov 15, 2022

Hi @pdeubel , thank you for providing a link to your sweep. I can access the link. I’ll be running some additional tests today and attempt to reproduce this behavior. It may be related to how we process nested configs. I will update you soon.

0reactions
pdeubelcommented, Dec 6, 2022

Thanks four your time and in-depth tests. Indeed, I refactored some code between starting the sweep and restarting it, but the changes did not (or at least should not have) altered the logic of the code. However that may be why some runs were duplicated because you could not reproduce it. Unfortunately I do not recall how I restarted the sweep, probably through the Python API. I may also have deleted failed runs while the sweep was still running, I’ll try to avoid that in the future, it was not so clear to me that this could cause problems.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - Documentation - Weights & Biases - Wandb
3. Try running W&B Private Hosting, which operates on your machine and doesn't sync files to our cloud servers.
Read more >
FAQ - Documentation - Weights & Biases - Wandb
For runs that are not part of a sweep, the values of wandb.config are usually set by providing a dictionary to the config...
Read more >
wandb.Run - Documentation - Weights & Biases
Returns the display name of the run. Display names are not guaranteed to be unique and may be descriptive. By default, they are...
Read more >
wandb sync - Documentation
You most likely lost connection to your machine while training. You can recover your data by running wandb ...
Read more >
Troubleshooting - Documentation - Weights & Biases - WandB
This is likely a connection problem — if your server loses internet access and data stops syncing to W&B, we mark the run...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found