Hints for using cltk with own latin corpus
See original GitHub issueThis is not really an issue, but a question (I can not find any mailing list on the webpage, sorry).
Since I’m a little lost with the way cltk works, I would appreciate if I could get some help with the following workflow: I want to work with a ‘private’ corpus in latin (resolutions of a catholic religious order from 1500 to 1800). My questions are:
- do I need to create a git repository for that? Or is there a possibility to work with local files?
- what is exactly the relationship between
import_corpus
(CorpusImporter
) and the objects of class corpus created by nltk? - since I only want to do some exploratory analysis of the corpus, what is the best method I can use then corpus imported with cltk with the methods provided by nltk (frequencies, and so on)?
Many thanks in advance (and many thanks of course for the wonderful library).
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Latin — Classical Language Toolkit documentation
The CorpusReader methods: paras() returns paragraphs, if possible; words() returns a generator of words; sentences returns a generator of sentences; docs ...
Read more >How do I access the PHI 5.3 corpus through CLTK?
CLTK (the Classical Languages ToolKit) seems to contain several tools to work with the Packhum Latin corpus. However, the actual setup ...
Read more >CLTK Module in Python - Stack Overflow
I have just begun using the CLTK (classical languages toolkit) NLT module in Python, and wish to use it as a lemmatizer for...
Read more >Building a Text Analysis Pipeline for Classical Languages
CLTK shows promise of addressing the desideratum of a complete text analysis pipe- line for Greek and Latin, as well as a large...
Read more >The Future of Ancient Literacy: Classical Language Toolkit ...
[ back ] 12. The CLTK corpus importer allows users to specify their own data sets, in the event that they want to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Let me add quickly to “If your data is only available on your local filesystem” and if it is a plaintext corpus, you can use NLTK’s PlaintextCorpusReader to help manage the files and get out-of-the-box tokenization (para, sent, word). Cf. https://pynlp.wordpress.com/2013/12/10/unit-5-part-ii-working-with-files-ii-the-plain-text-corpus-reader-of-nltk/. This is the basis of the Latin Library reader in cltk.corpus.latin. If it is not plaintext, you may be able to use a different NLTK reader; see here: http://www.nltk.org/howto/corpus.html.
Looks like the issue has been resolved. We may want to update the documentation though.