question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom data folder caching is relative instead of absolute

See original GitHub issue

Most os the datasets rely on HF, and in particular its cache handling. However there are a few tasks when the dataset download/caching is done manually. https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py#L11-L23

This is usually not a problem if one continually uses the main script at the root of the repo to run evaluation. But as soon as we run it from another path, then the caching doesn’t work anymore and generates a new data folder. For example, we tried using this repo and create our custom bindings to run the harness on a Megatron-Deepspeed model. https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212#discussion_r766754427 . The reason is that the caching mechanism right now uses relative paths instead of an absolute path.

I think we should replace data/* with an absolute path, in order to have a single cache. HF datasets have a default to ~/.cache, and it exposes a overridable flag to point to a specific path HF_DATASETS_CACHE https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212#discussion_r766754427 . Is that something you’d be willing to add to the repo? @leogao2

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
thomasw21commented, Apr 4, 2022

Maybe @lhoestq is the best one to answer such questions.

1reaction
jon-towcommented, Apr 4, 2022

Oh yeah! That was definitely in the plans - I just haven’t set aside the time to do it 😅. The main candidates to upstream over there are:

  • GSM8K
  • MuTual
  • SAT_Analogies
  • TruthfulQA

I’ll create separate PRs for each by the end of the week.
@thomasw21 is it preferred to add datasets via the Hub following https://huggingface.co/docs/datasets/share or as PRs to the repo? Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Change .cache folder files paths from absolute to relative. Is it ...
Try the following: buildspec.yml version: 0.2 phases: pre_build: commands: - mkdir -p /build-directory - cp -a ${CODEBUILD_SRC_DIR}/.
Read more >
cache dir | Data Version Control - DVC
However, if the value provided is an absolute path, then it's preserved as it is. If no path value is given, it prints...
Read more >
cache folder file paths from absolute to relative #15870 - GitHub
Recently I built gatsby site using gatsby-source-custom-api to get information from json files and using the functionality to download and ...
Read more >
Absolute vs Relative Imports in Python
This is a cache of all modules that have been previously imported. If the name isn't found in the module cache, Python will...
Read more >
Caching Dependencies - CircleCI
Be sure to keep your cache keys under this maximum. The path for directories is relative to the working_directory of your job. You...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found