Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

the_pile datasets URL broken.

See original GitHub issue

https://github.com/huggingface/datasets/pull/3627 changed the Eleuther AI Pile dataset URL from https://the-eye.eu/ to https://mystic.the-eye.eu/ but the latter is now broken and the former works again.

Note that when I git clone the repo and use pip install -e . and then edit the URL back the codebase doesn’t seem to use this edit so the mystic URL is also cached somewhere else that I can’t find?

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

albertvillanovacommented, Jul 22, 2022

Thanks @TrentBrick for the suggestion about improving our docs: we should definitely do this if you find they are not clear enough.

Currently, our docs explain how to load a dataset from a local loading script here: Load > Local loading script

I’ve opened an issue here:

#4732

Feel free to comment on it any additional explanation/suggestion/requirement related to this problem.

0reactions

TrentBrickcommented, Jul 21, 2022

Thanks for the quick reply and help too

Top Results From Across the Web

An 800GB Dataset of Diverse Text for Language Modeling

To address this need, we introduce the Pile: a 825.18 GiB English text dataset designed for train- ing large scale language models. The...

[R] The Pile: An 800GB Dataset of Diverse Text for Language ...

EleutherAI is proud to announce the release of the Pile, a free and publicly available 800GB dataset of diverse English text for language ......

The Pile: An 800GB Dataset of Diverse Text for ... - arXiv Vanity

Detailed information about the construction of each dataset is available in Appendix C. 2.1 Pile-CC. Common Crawl is a collection of website crawls...

arXiv:2101.00027v1 [cs.CL] 31 Dec 2020

The Pile : An 800GB Dataset of Diverse Text for Language Modeling ... Common Crawl is a collection of website crawls ... We...

EleutherAI/gpt-j-6B - Hugging Face

Training data. GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.