Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docs for Accessing Corpora

See original GitHub issue

Either I’m missing something obvious (which is likely), or CLTK offers no documentation on how to use the various corpora the project provides.

After importing the greek_text_perseus corpus, for example, its README.md tells me

This repository holds the Greek files made available by the Perseus Project. See the CLTK’s docs for instructions on how to use these files.

The docs, however, only cover how to download corpora and how to process raw text stored in a Python variable, respectively, omitting the intermediate steps. There is no mention of how one might import a corpus after downloading it (which, I see from this external blog, seems to be a thing?), or how one might otherwise get ahold of a CorpusReader object (assuming such a thing exists, which is not clear from the docs).

From all this I infer that it seems we are intended to

Use CLTK to conveniently download corpora, but not to load them.
Use NLTK or some other 3rd-party tool to load the corpora directly from the resulting text or XML files.
Proceed as usual with NLP analysis, turning to CLTK only when we need language-specific processing capabilities at a low level.

Am I correct in piecing together this puzzle? If so, I haven’t seen such a scheme spelled out anywhere in the docs. Perhaps I am blind to something?

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:19 (8 by maintainers)

Top GitHub Comments

2reactions

diyclassicscommented, Dec 12, 2017

@SigmaX—thanks for starting this discussion. There is a lot of work that could be (and should be) done with making corpora easier to work with. This is why I wrote a PlaintextCorpusReader wrapper for the CLTK Latin Library corpus (basically, # 2 from your list); cf. https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/. I’ll be sure to add this functionality to the docs.

There has been some discussion here of adding more wrappers like this, esp. XMLCorpusReader wrappers for the Perseus texts (cf. https://github.com/cltk/cltk/issues/554). If there is interest, I can revisit this. I’d be happy to hear which other corpora you would like better access to as well.

Also, my guess is that # 3 from your list is the way CLTK is used for the most part. But in the interest of a self-contained NLP workflow, I think a better defined pipeline from corpus/data to analysis would be worth pursuing.

1reaction

jtaubercommented, Dec 14, 2017

The helper code that Eldarion is developing on top of MyCaptains for the new Perseus will likely help with this. It will hopefully be open source in the next month or so.

Top Results From Across the Web

Corpora 1.0 documentation - PythonHosted.org

Corpora is a lightweight, fast and scalable corpus library able to store a collection of raw text documents with additional key-value headers. It...

Sample usage for corpus - NLTK

Most corpora consist of a set of files, each containing a document (or other pieces of text). A list of identifiers for these...

Working with text corpora — tmtoolkit documentation

Working with text corpora . Your text data usually comes in the form of (long) plain text strings that are stored in one...

corpora.indexedcorpus – Random access to corpus documents

Indexed corpus is a mechanism for random-accessing corpora. While the standard corpus interface in gensim allows iterating over corpus, we'll ...

Python - Corpora Access - Tutorialspoint

Python - Corpora Access, Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous...