question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docs for Accessing Corpora

See original GitHub issue

Either I’m missing something obvious (which is likely), or CLTK offers no documentation on how to use the various corpora the project provides.

After importing the greek_text_perseus corpus, for example, its README.md tells me

This repository holds the Greek files made available by the Perseus Project. See the CLTK’s docs for instructions on how to use these files.

The docs, however, only cover how to download corpora and how to process raw text stored in a Python variable, respectively, omitting the intermediate steps. There is no mention of how one might import a corpus after downloading it (which, I see from this external blog, seems to be a thing?), or how one might otherwise get ahold of a CorpusReader object (assuming such a thing exists, which is not clear from the docs).

From all this I infer that it seems we are intended to

  1. Use CLTK to conveniently download corpora, but not to load them.
  2. Use NLTK or some other 3rd-party tool to load the corpora directly from the resulting text or XML files.
  3. Proceed as usual with NLP analysis, turning to CLTK only when we need language-specific processing capabilities at a low level.

Am I correct in piecing together this puzzle? If so, I haven’t seen such a scheme spelled out anywhere in the docs. Perhaps I am blind to something?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:4
  • Comments:19 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
diyclassicscommented, Dec 12, 2017

@SigmaX—thanks for starting this discussion. There is a lot of work that could be (and should be) done with making corpora easier to work with. This is why I wrote a PlaintextCorpusReader wrapper for the CLTK Latin Library corpus (basically, # 2 from your list); cf. https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/. I’ll be sure to add this functionality to the docs.

There has been some discussion here of adding more wrappers like this, esp. XMLCorpusReader wrappers for the Perseus texts (cf. https://github.com/cltk/cltk/issues/554). If there is interest, I can revisit this. I’d be happy to hear which other corpora you would like better access to as well.

Also, my guess is that # 3 from your list is the way CLTK is used for the most part. But in the interest of a self-contained NLP workflow, I think a better defined pipeline from corpus/data to analysis would be worth pursuing.

1reaction
jtaubercommented, Dec 14, 2017

The helper code that Eldarion is developing on top of MyCaptains for the new Perseus will likely help with this. It will hopefully be open source in the next month or so.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Corpora 1.0 documentation - PythonHosted.org
Corpora is a lightweight, fast and scalable corpus library able to store a collection of raw text documents with additional key-value headers. It...
Read more >
Sample usage for corpus - NLTK
Most corpora consist of a set of files, each containing a document (or other pieces of text). A list of identifiers for these...
Read more >
Working with text corpora — tmtoolkit documentation
Working with text corpora . Your text data usually comes in the form of (long) plain text strings that are stored in one...
Read more >
corpora.indexedcorpus – Random access to corpus documents
Indexed corpus is a mechanism for random-accessing corpora. While the standard corpus interface in gensim allows iterating over corpus, we'll ...
Read more >
Python - Corpora Access - Tutorialspoint
Python - Corpora Access, Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found