question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

the_pile datasets URL broken.

See original GitHub issue

https://github.com/huggingface/datasets/pull/3627 changed the Eleuther AI Pile dataset URL from https://the-eye.eu/ to https://mystic.the-eye.eu/ but the latter is now broken and the former works again.

Note that when I git clone the repo and use pip install -e . and then edit the URL back the codebase doesn’t seem to use this edit so the mystic URL is also cached somewhere else that I can’t find?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
albertvillanovacommented, Jul 22, 2022

Thanks @TrentBrick for the suggestion about improving our docs: we should definitely do this if you find they are not clear enough.

Currently, our docs explain how to load a dataset from a local loading script here: Load > Local loading script

I’ve opened an issue here:

Feel free to comment on it any additional explanation/suggestion/requirement related to this problem.

0reactions
TrentBrickcommented, Jul 21, 2022

Thanks for the quick reply and help too

Read more comments on GitHub >

github_iconTop Results From Across the Web

An 800GB Dataset of Diverse Text for Language Modeling
To address this need, we introduce the Pile: a 825.18 GiB English text dataset designed for train- ing large scale language models. The...
Read more >
[R] The Pile: An 800GB Dataset of Diverse Text for Language ...
EleutherAI is proud to announce the release of the Pile, a free and publicly available 800GB dataset of diverse English text for language ......
Read more >
The Pile: An 800GB Dataset of Diverse Text for ... - arXiv Vanity
Detailed information about the construction of each dataset is available in Appendix C. 2.1 Pile-CC. Common Crawl is a collection of website crawls...
Read more >
arXiv:2101.00027v1 [cs.CL] 31 Dec 2020
The Pile : An 800GB Dataset of Diverse Text for Language Modeling ... Common Crawl is a collection of website crawls ... We...
Read more >
EleutherAI/gpt-j-6B - Hugging Face
Training data. GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found