Use XMLReader for Open Greek and Latin
See original GitHub issueI forked the First1KGreek repo and added it to the Greek corpora, which is in TEI.
At the suggestion of @diyclassics I have looked into using the NLTK’s XML corpus reader. Here’s what I see:
In [1]: from nltk.corpus.reader import XMLCorpusReader
In [2]: f = '/Users/kyle.p.johnson/cltk_data/greek/text/greek_text_first1kgreek/
...: data/tlg0015/tlg001/tlg0015.tlg001.opp-grc1.xml'
In [3]: import os
In [4]: path, name = os.path.split(f)
In [5]: path
Out[5]: '/Users/kyle.p.johnson/cltk_data/greek/text/greek_text_first1kgreek/data/tlg0015/tlg001'
In [6]: name
Out[6]: 'tlg0015.tlg001.opp-grc1.xml'
In [7]: reader = XMLCorpusReader(path, name)
In [8]: reader.raw()[:100]
Out[8]: '<?xml version="1.0" encoding="UTF-8"?>\n<?xml-model href="http://www.stoa.org/epidoc/schema/latest/te'
In [9]: reader.raw()[:200]
Out[9]: '<?xml version="1.0" encoding="UTF-8"?>\n<?xml-model href="http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>\n<TEI xmlns="http://www.tei-c.org/'
In [10]: reader.words()[:10]
Out[10]:
['Ab',
'excessu',
'divi',
'Marci',
'Herodian',
'Immanuel',
'Bekker',
'European',
'Social',
'Fund']
Since .words()
returns metadata, we will need something else.
What do others think about parsing TEI XML? Do we need to parse our selves using beautiful soup or the core xml
library?
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Open Greek & Latin – An international collaboration ...
An international collaboration committed to creating an open educational resource featuring a corpus of digital texts, deep-reading tools, and open-source ...
Read more >Parsing xml with unicode characters - Oracle Communities
Hi, I'm building applications that use both latin and cyrillic characters. ... and latin characters: XMLReader parser = (XMLReader)Class.
Read more >Read XML file with UTF-8 Encoding - Stack Overflow
You need to read file with utf-8 encoding, then pass to XmlReader. Answer is here. · I Used StreamReader reader = new StreamReader("~/Data.xml", ......
Read more >XMLReader::open - Manual - PHP
When using the XmlReader to read local XML files, remember it the open function requests a URI. Add 'file://' to the front of...
Read more >New Features - ImageGear for .NET - Accusoft Support
Improved Office Open XML reader for DOCX, PPTX, and XLSX formats ... Improvements in accuracy for PDF417, QR Codes, Data Matrix, and 1D...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Another update: I have been forwarded this code, which I’ll use to process the XML: https://github.com/Capitains/HookTest/blob/master/HookTest/build.py#L91
Update: This repo actually has plain text files in it, already: https://github.com/cltk/First1KGreek/tree/1.1.1603/text
However an XML reader would still be desirable to users, I imagine.