Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hints for using cltk with own latin corpus

See original GitHub issue

This is not really an issue, but a question (I can not find any mailing list on the webpage, sorry).

Since I’m a little lost with the way cltk works, I would appreciate if I could get some help with the following workflow: I want to work with a ‘private’ corpus in latin (resolutions of a catholic religious order from 1500 to 1800). My questions are:

do I need to create a git repository for that? Or is there a possibility to work with local files?
what is exactly the relationship between import_corpus (CorpusImporter) and the objects of class corpus created by nltk?
since I only want to do some exploratory analysis of the corpus, what is the best method I can use then corpus imported with cltk with the methods provided by nltk (frequencies, and so on)?

Many thanks in advance (and many thanks of course for the wonderful library).

Issue Analytics

State:
Created 6 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

diyclassicscommented, Jan 25, 2018

Let me add quickly to “If your data is only available on your local filesystem” and if it is a plaintext corpus, you can use NLTK’s PlaintextCorpusReader to help manage the files and get out-of-the-box tokenization (para, sent, word). Cf. https://pynlp.wordpress.com/2013/12/10/unit-5-part-ii-working-with-files-ii-the-plain-text-corpus-reader-of-nltk/. This is the basis of the Latin Library reader in cltk.corpus.latin. If it is not plaintext, you may be able to use a different NLTK reader; see here: http://www.nltk.org/howto/corpus.html.

0reactions

todd-cookcommented, May 22, 2018

Looks like the issue has been resolved. We may want to update the documentation though.

Top Results From Across the Web

Latin — Classical Language Toolkit documentation

The CorpusReader methods: paras() returns paragraphs, if possible; words() returns a generator of words; sentences returns a generator of sentences; docs ...

How do I access the PHI 5.3 corpus through CLTK?

CLTK (the Classical Languages ToolKit) seems to contain several tools to work with the Packhum Latin corpus. However, the actual setup ...

CLTK Module in Python - Stack Overflow

I have just begun using the CLTK (classical languages toolkit) NLT module in Python, and wish to use it as a lemmatizer for...

Building a Text Analysis Pipeline for Classical Languages

CLTK shows promise of addressing the desideratum of a complete text analysis pipe- line for Greek and Latin, as well as a large...

The Future of Ancient Literacy: Classical Language Toolkit ...

[ back ] 12. The CLTK corpus importer allows users to specify their own data sets, in the event that they want to...