the_pile datasets URL broken.
See original GitHub issuehttps://github.com/huggingface/datasets/pull/3627 changed the Eleuther AI Pile dataset URL from https://the-eye.eu/ to https://mystic.the-eye.eu/ but the latter is now broken and the former works again.
Note that when I git clone the repo and use pip install -e .
and then edit the URL back the codebase doesn’t seem to use this edit so the mystic URL is also cached somewhere else that I can’t find?
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
An 800GB Dataset of Diverse Text for Language Modeling
To address this need, we introduce the Pile: a 825.18 GiB English text dataset designed for train- ing large scale language models. The...
Read more >[R] The Pile: An 800GB Dataset of Diverse Text for Language ...
EleutherAI is proud to announce the release of the Pile, a free and publicly available 800GB dataset of diverse English text for language ......
Read more >The Pile: An 800GB Dataset of Diverse Text for ... - arXiv Vanity
Detailed information about the construction of each dataset is available in Appendix C. 2.1 Pile-CC. Common Crawl is a collection of website crawls...
Read more >arXiv:2101.00027v1 [cs.CL] 31 Dec 2020
The Pile : An 800GB Dataset of Diverse Text for Language Modeling ... Common Crawl is a collection of website crawls ... We...
Read more >EleutherAI/gpt-j-6B - Hugging Face
Training data. GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @TrentBrick for the suggestion about improving our docs: we should definitely do this if you find they are not clear enough.
Currently, our docs explain how to load a dataset from a local loading script here: Load > Local loading script
I’ve opened an issue here:
Feel free to comment on it any additional explanation/suggestion/requirement related to this problem.
Thanks for the quick reply and help too