Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom data folder caching is relative instead of absolute

See original GitHub issue

Most os the datasets rely on HF, and in particular its cache handling. However there are a few tasks when the dataset download/caching is done manually. https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py#L11-L23

This is usually not a problem if one continually uses the main script at the root of the repo to run evaluation. But as soon as we run it from another path, then the caching doesn’t work anymore and generates a new data folder. For example, we tried using this repo and create our custom bindings to run the harness on a Megatron-Deepspeed model. https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212#discussion_r766754427 . The reason is that the caching mechanism right now uses relative paths instead of an absolute path.

I think we should replace data/* with an absolute path, in order to have a single cache. HF datasets have a default to ~/.cache, and it exposes a overridable flag to point to a specific path HF_DATASETS_CACHE https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212#discussion_r766754427 . Is that something you’d be willing to add to the repo? @leogao2

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

thomasw21commented, Apr 4, 2022

Maybe @lhoestq is the best one to answer such questions.

1reaction

jon-towcommented, Apr 4, 2022

Oh yeah! That was definitely in the plans - I just haven’t set aside the time to do it 😅. The main candidates to upstream over there are:

GSM8K
MuTual
SAT_Analogies
TruthfulQA

I’ll create separate PRs for each by the end of the week.
@thomasw21 is it preferred to add datasets via the Hub following https://huggingface.co/docs/datasets/share or as PRs to the repo? Thanks!