Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scan_history does not efficiently retrieve sparse metrics

See original GitHub issue

Describe the bug

Calling scan_history returns the very first data point but doesn’t seem to make any subsequent progress. Script seems to be stuck in wandb networking code, sample stacktrace:

Thread 1494 (idle): "MainThread"
    read (ssl.py:911)
    recv_into (ssl.py:1052)
    readinto (socket.py:589)
    _read_status (http/client.py:257)
    begin (http/client.py:296)
    getresponse (http/client.py:1321)
    _make_request (urllib3/connectionpool.py:383)
    urlopen (urllib3/connectionpool.py:603)
    send (requests/adapters.py:449)
    send (requests/sessions.py:646)
    request (requests/sessions.py:533)
    request (requests/api.py:60)
    post (requests/api.py:116)
    execute (gql/transport/requests.py:38)
    _get_result (gql/client.py:60)
    execute (gql/client.py:52)
    execute (wandb/apis/public.py:178)
    __call__ (wandb/old/retry.py:96)
    wrapped_fn (wandb/old/retry.py:132)
    _load_next (wandb/apis/public.py:2002)
    __call__ (wandb/old/retry.py:96)
    wrapped_fn (wandb/old/retry.py:132)
    wrapper (wandb/apis/normalize.py:24)
    __next__ (wandb/apis/public.py:1975)
    fetch_run_data (plot_results.py:15)
    <module> (plot_results.py:23)

The run I’m trying to download has a very large number of data points, but the requested metric is present in only 26 rows and the run page has now trouble loading it.

To Reproduce

My code:

import wandb
from typing import Tuple

def fetch_run_data(descriptor: str, metric: str) -> Tuple[np.array, np.array]:
    api = wandb.Api()
    runs = api.runs("cswinter/deep-codecraft-vs", {"config.descriptor": descriptor})

    run = runs[0]
    step = []
    value = []
    vals = run.scan_history(keys=[metric, '_step'], page_size=1000, min_step=None, max_step=None)
    for entry in vals:
        if metric in entry:
            print('yay')
            print(entry)
            step.append(entry['_step'])
            value.append(entry[metric])
    return np.array(step), np.array(value)

frames, step = fetch_run_data("f2034f-hpsetstandard", "eval_mean_score")

# plot data...

It just outputs this and then hangs for at least several minutes:

yay
{'eval_mean_score': -0.9901801347732544, '_step': 0}

Expected behavior I expect the full set of data points to be retrieved within seconds.

Operating System WSL 1 wandb==0.10.12

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

cswintercommented, Jan 9, 2021

Hmm ok, history does seem to do what I want when setting the right keys. Is it guaranteed to return the full underlying dataset without any sampling/interpolation when settings the samples param higher than the number of data points?

0reactions

vanpeltcommented, Jan 9, 2021

Gotcha, the pagination logic for scanning the entire history needs to assume there could be a data point at every step so it needs to page across the entire 125 million which is what makes it so slow. We’re working on a big overhaul of our metrics backend so hopefully we can address this case better in the future.