Docs for Accessing Corpora
See original GitHub issueEither I’m missing something obvious (which is likely), or CLTK offers no documentation on how to use the various corpora the project provides.
After importing the greek_text_perseus
corpus, for example, its README.md
tells me
This repository holds the Greek files made available by the Perseus Project. See the CLTK’s docs for instructions on how to use these files.
The docs, however, only cover how to download corpora and how to process raw text stored in a Python variable, respectively, omitting the intermediate steps. There is no mention of how one might import
a corpus after downloading it (which, I see from this external blog, seems to be a thing?), or how one might otherwise get ahold of a CorpusReader
object (assuming such a thing exists, which is not clear from the docs).
From all this I infer that it seems we are intended to
- Use CLTK to conveniently download corpora, but not to load them.
- Use NLTK or some other 3rd-party tool to load the corpora directly from the resulting text or XML files.
- Proceed as usual with NLP analysis, turning to CLTK only when we need language-specific processing capabilities at a low level.
Am I correct in piecing together this puzzle? If so, I haven’t seen such a scheme spelled out anywhere in the docs. Perhaps I am blind to something?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:19 (8 by maintainers)
Top GitHub Comments
@SigmaX—thanks for starting this discussion. There is a lot of work that could be (and should be) done with making corpora easier to work with. This is why I wrote a PlaintextCorpusReader wrapper for the CLTK Latin Library corpus (basically, # 2 from your list); cf. https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/. I’ll be sure to add this functionality to the docs.
There has been some discussion here of adding more wrappers like this, esp. XMLCorpusReader wrappers for the Perseus texts (cf. https://github.com/cltk/cltk/issues/554). If there is interest, I can revisit this. I’d be happy to hear which other corpora you would like better access to as well.
Also, my guess is that # 3 from your list is the way CLTK is used for the most part. But in the interest of a self-contained NLP workflow, I think a better defined pipeline from corpus/data to analysis would be worth pursuing.
The helper code that Eldarion is developing on top of MyCaptains for the new Perseus will likely help with this. It will hopefully be open source in the next month or so.