Issue with downloading Wikipedia data for low resource language
See original GitHub issueHi, I tried to download Sundanese and Javanese wikipedia data with the following snippet
jv_wiki = datasets.load_dataset('wikipedia', '20200501.jv', beam_runner='DirectRunner')
su_wiki = datasets.load_dataset('wikipedia', '20200501.su', beam_runner='DirectRunner')
And I get the following error for these two languages: Javanese
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/jvwiki/20200501/dumpstatus.json
Sundanese
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/suwiki/20200501/dumpstatus.json
I found from https://github.com/huggingface/datasets/issues/577#issuecomment-688435085 that for small languages, they are directly downloaded and parsed from the Wikipedia dump site, but both of https://dumps.wikimedia.org/jvwiki/20200501/dumpstatus.json
and https://dumps.wikimedia.org/suwiki/20200501/dumpstatus.json
are no longer valid.
Any suggestions on how to handle this issue? Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Wikipedia:Database download
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, ...
Read more >Mining Large-Scale Low-Resource Pronunciation Data From ...
We report on a system we built to mine a pronunciation data set in 819 languages from loosely structured tables within Wikipedia. The...
Read more >Boosting Natural Language Processing with Wikipedia
We now demonstrate how it is possible to leverage Wikipedia to boost the performance of two NLP tasks: Named Entity Recognition and Topic...
Read more >Taking advantage of Wikipedia in Natural Language Processing
Wikipedia. Multi-lingual NLP researchers are especially keen on utilizing Wikipedia since resource bottlenecks are often more of a problem.
Read more >A Topic-Aligned Multilingual Corpus of Wikipedia Articles for ...
... Corpus of Wikipedia Articles for Studying. Information Asymmetry in Low Resource Languages ... cessing and extracting Wikipedia data and replicating our.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for reporting I created a PR to make the custom config work (language=“zh”, date=“20201120”).
For posterity, here’s how I got the data I needed: I needed Bengali, so I had to check which dumps are available here: https://dumps.wikimedia.org/bnwiki/ , then I ran: