Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement NLTK CorpusReader(s) for existing corpora

See original GitHub issue

See: #32 #296

NLTK’s PlainTextCorpusReader may work for e.g. Lacus Curtius and other plaintext corpora (as is now done for The Latin Library). XMLCorpusReader may work for some XML corpora.

As an example, I’d like to be able to call .words() on all the available CorpusReader instances for a language, so I can programmatically build a comprehensive dictionary of unique words in each language. I can also imagine people wanting to be able to do the same for sentences, and so on. See here for common NLTK corpus reader functions: http://www.nltk.org/api/nltk.corpus.html#module-nltk.corpus

We might want to use the existing defined lists of corpora and attributes to do this in some programmatic way as well, i.e. if type is text, and markup is plaintext, use PlainTextCorpusReader and name to construct the path. We could load the reader instances into a Python dictionary based on name as well.

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:14 (12 by maintainers)

Top GitHub Comments

1reaction

ryanfbcommented, Aug 17, 2016

FWIW, I experimented the other day with adding an XMLCorpusReader for Perseus Latin text and ran into a couple of issues. One was the XMLCorpusReader seems to only work with a single corpus fileid (probably easy enough to write a small wrapper around so that it can be used for all fileids in a corpus). The other was that there appears to be some weirdness in the Perseus XML which prevented it from parsing (which might be an upstream issue we need to discuss with Perseus):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 65, in words
    elt = self.xml(fileid)
  File "/usr/local/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 48, in xml
    elt = ElementTree.parse(self.abspath(fileid).open()).getroot()
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1184, in parse
    tree.parse(source, parser)
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 596, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity &responsibility;: line 14, column 0

See here for instances of this in the corpus: https://github.com/cltk/latin_text_perseus/search?utf8=✓&q=%26responsibility%3B

1reaction

diyclassicscommented, Aug 17, 2016

@ryanfb —I think that these are great ideas and plan to address them in time (though free to contribute, if you’re so inclined!). I started with the Latin Library for these practical reasons: 1. to demonstrate the usefulness of having access to the corpus reader methods, 2. to make it as easy as possible for people curious about CLTK, esp. beginners to get up and running with something familiar, and 3. to have a common set of texts to base a series of blog posts on. I think this has worked out so far. So, yes, I think extending this functionality is a good idea—testing out XMLCorpusReader on the Perseus corpus might be a good next step.

Also, I like this attributes-based approach—if we experiment with XMLCorpusReader and Perseus, we can test a simple plaintext/xml detection setup with those two corpora.

Top Results From Across the Web

Sample usage for corpus - NLTK

The nltk.corpus package defines a collection of corpus reader classes, ... Use nltk.app.pos_concordance() to access a GUI for searching tagged corpora.

Corpus Readers - NLTK

Each corpus reader provides a variety of methods to read data from the corpus, depending on the format of the corpus. For example,...

Creating a new corpus with NLTK - python - Stack Overflow

The main idea is to make use of the nltk.corpus.reader package. ... and # Each paragraph contains sentence(s), and # Each sentence contains ......

Creating a custom corpus view - StudyLib

At the same time, you'll learn how to use the existing corpus data that comes with NLTK. We'll also cover creating custom corpus...

Adding a Corpus · nltk/nltk Wiki - GitHub

Use existing NLTK corpus readers where possible, or else contribute a well-documented corpus reader to NLTK. To add a corpus to NLTK, please ......