question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement NLTK CorpusReader(s) for existing corpora

See original GitHub issue

See: #32 #296

NLTK’s PlainTextCorpusReader may work for e.g. Lacus Curtius and other plaintext corpora (as is now done for The Latin Library). XMLCorpusReader may work for some XML corpora.

As an example, I’d like to be able to call .words() on all the available CorpusReader instances for a language, so I can programmatically build a comprehensive dictionary of unique words in each language. I can also imagine people wanting to be able to do the same for sentences, and so on. See here for common NLTK corpus reader functions: http://www.nltk.org/api/nltk.corpus.html#module-nltk.corpus

We might want to use the existing defined lists of corpora and attributes to do this in some programmatic way as well, i.e. if type is text, and markup is plaintext, use PlainTextCorpusReader and name to construct the path. We could load the reader instances into a Python dictionary based on name as well.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:14 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
ryanfbcommented, Aug 17, 2016

FWIW, I experimented the other day with adding an XMLCorpusReader for Perseus Latin text and ran into a couple of issues. One was the XMLCorpusReader seems to only work with a single corpus fileid (probably easy enough to write a small wrapper around so that it can be used for all fileids in a corpus). The other was that there appears to be some weirdness in the Perseus XML which prevented it from parsing (which might be an upstream issue we need to discuss with Perseus):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 65, in words
    elt = self.xml(fileid)
  File "/usr/local/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 48, in xml
    elt = ElementTree.parse(self.abspath(fileid).open()).getroot()
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1184, in parse
    tree.parse(source, parser)
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 596, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity &responsibility;: line 14, column 0

See here for instances of this in the corpus: https://github.com/cltk/latin_text_perseus/search?utf8=✓&q=%26responsibility%3B

1reaction
diyclassicscommented, Aug 17, 2016

@ryanfb —I think that these are great ideas and plan to address them in time (though free to contribute, if you’re so inclined!). I started with the Latin Library for these practical reasons: 1. to demonstrate the usefulness of having access to the corpus reader methods, 2. to make it as easy as possible for people curious about CLTK, esp. beginners to get up and running with something familiar, and 3. to have a common set of texts to base a series of blog posts on. I think this has worked out so far. So, yes, I think extending this functionality is a good idea—testing out XMLCorpusReader on the Perseus corpus might be a good next step.

Also, I like this attributes-based approach—if we experiment with XMLCorpusReader and Perseus, we can test a simple plaintext/xml detection setup with those two corpora.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sample usage for corpus - NLTK
The nltk.corpus package defines a collection of corpus reader classes, ... Use nltk.app.pos_concordance() to access a GUI for searching tagged corpora.
Read more >
Corpus Readers - NLTK
Each corpus reader provides a variety of methods to read data from the corpus, depending on the format of the corpus. For example,...
Read more >
Creating a new corpus with NLTK - python - Stack Overflow
The main idea is to make use of the nltk.corpus.reader package. ... and # Each paragraph contains sentence(s), and # Each sentence contains ......
Read more >
Creating a custom corpus view - StudyLib
At the same time, you'll learn how to use the existing corpus data that comes with NLTK. We'll also cover creating custom corpus...
Read more >
Adding a Corpus · nltk/nltk Wiki - GitHub
Use existing NLTK corpus readers where possible, or else contribute a well-documented corpus reader to NLTK. To add a corpus to NLTK, please ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found