Implement NLTK CorpusReader(s) for existing corpora
See original GitHub issueNLTK’s PlainTextCorpusReader
may work for e.g. Lacus Curtius and other plaintext corpora (as is now done for The Latin Library). XMLCorpusReader
may work for some XML corpora.
As an example, I’d like to be able to call .words()
on all the available CorpusReader instances for a language, so I can programmatically build a comprehensive dictionary of unique words in each language. I can also imagine people wanting to be able to do the same for sentences, and so on. See here for common NLTK corpus reader functions: http://www.nltk.org/api/nltk.corpus.html#module-nltk.corpus
We might want to use the existing defined lists of corpora and attributes to do this in some programmatic way as well, i.e. if type
is text
, and markup
is plaintext
, use PlainTextCorpusReader
and name
to construct the path. We could load the reader instances into a Python dictionary based on name
as well.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:14 (12 by maintainers)
Top GitHub Comments
FWIW, I experimented the other day with adding an XMLCorpusReader for Perseus Latin text and ran into a couple of issues. One was the XMLCorpusReader seems to only work with a single corpus fileid (probably easy enough to write a small wrapper around so that it can be used for all fileids in a corpus). The other was that there appears to be some weirdness in the Perseus XML which prevented it from parsing (which might be an upstream issue we need to discuss with Perseus):
See here for instances of this in the corpus: https://github.com/cltk/latin_text_perseus/search?utf8=✓&q=%26responsibility%3B
@ryanfb —I think that these are great ideas and plan to address them in time (though free to contribute, if you’re so inclined!). I started with the Latin Library for these practical reasons: 1. to demonstrate the usefulness of having access to the corpus reader methods, 2. to make it as easy as possible for people curious about CLTK, esp. beginners to get up and running with something familiar, and 3. to have a common set of texts to base a series of blog posts on. I think this has worked out so far. So, yes, I think extending this functionality is a good idea—testing out XMLCorpusReader on the Perseus corpus might be a good next step.
Also, I like this attributes-based approach—if we experiment with XMLCorpusReader and Perseus, we can test a simple plaintext/xml detection setup with those two corpora.