Custom data folder caching is relative instead of absolute
See original GitHub issueMost os the datasets rely on HF, and in particular its cache handling. However there are a few tasks when the dataset download/caching is done manually. https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py#L11-L23
This is usually not a problem if one continually uses the main
script at the root of the repo to run evaluation. But as soon as we run it from another path, then the caching doesn’t work anymore and generates a new data folder. For example, we tried using this repo and create our custom bindings to run the harness on a Megatron-Deepspeed model. https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212#discussion_r766754427 . The reason is that the caching mechanism right now uses relative paths instead of an absolute path.
I think we should replace data/*
with an absolute path, in order to have a single cache. HF datasets have a default to ~/.cache
, and it exposes a overridable flag to point to a specific path HF_DATASETS_CACHE
https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212#discussion_r766754427 . Is that something you’d be willing to add to the repo? @leogao2
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (6 by maintainers)
Top GitHub Comments
Maybe @lhoestq is the best one to answer such questions.
Oh yeah! That was definitely in the plans - I just haven’t set aside the time to do it 😅. The main candidates to upstream over there are:
GSM8K
MuTual
SAT_Analogies
TruthfulQA
I’ll create separate PRs for each by the end of the week.
@thomasw21 is it preferred to add datasets via the Hub following https://huggingface.co/docs/datasets/share or as PRs to the repo? Thanks!