Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use XMLReader for Open Greek and Latin

See original GitHub issue

I forked the First1KGreek repo and added it to the Greek corpora, which is in TEI.

At the suggestion of @diyclassics I have looked into using the NLTK’s XML corpus reader. Here’s what I see:

In [1]: from nltk.corpus.reader import XMLCorpusReader

In [2]: f = '/Users/kyle.p.johnson/cltk_data/greek/text/greek_text_first1kgreek/
   ...: data/tlg0015/tlg001/tlg0015.tlg001.opp-grc1.xml'

In [3]: import os

In [4]: path, name = os.path.split(f)

In [5]: path
Out[5]: '/Users/kyle.p.johnson/cltk_data/greek/text/greek_text_first1kgreek/data/tlg0015/tlg001'

In [6]: name
Out[6]: 'tlg0015.tlg001.opp-grc1.xml'

In [7]: reader = XMLCorpusReader(path, name)

In [8]: reader.raw()[:100]
Out[8]: '<?xml version="1.0" encoding="UTF-8"?>\n<?xml-model href="http://www.stoa.org/epidoc/schema/latest/te'

In [9]: reader.raw()[:200]
Out[9]: '<?xml version="1.0" encoding="UTF-8"?>\n<?xml-model href="http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>\n<TEI xmlns="http://www.tei-c.org/'

In [10]: reader.words()[:10]
Out[10]: 
['Ab',
 'excessu',
 'divi',
 'Marci',
 'Herodian',
 'Immanuel',
 'Bekker',
 'European',
 'Social',
 'Fund']

Since .words() returns metadata, we will need something else.

What do others think about parsing TEI XML? Do we need to parse our selves using beautiful soup or the core xml library?

Issue Analytics

State:
Created 6 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

kylepjohnsoncommented, Jul 12, 2017

Another update: I have been forwarded this code, which I’ll use to process the XML: https://github.com/Capitains/HookTest/blob/master/HookTest/build.py#L91

1reaction

kylepjohnsoncommented, Jul 12, 2017

Update: This repo actually has plain text files in it, already: https://github.com/cltk/First1KGreek/tree/1.1.1603/text

However an XML reader would still be desirable to users, I imagine.

Top Results From Across the Web

Open Greek & Latin – An international collaboration ...

An international collaboration committed to creating an open educational resource featuring a corpus of digital texts, deep-reading tools, and open-source ...

Parsing xml with unicode characters - Oracle Communities

Hi, I'm building applications that use both latin and cyrillic characters. ... and latin characters: XMLReader parser = (XMLReader)Class.

Read XML file with UTF-8 Encoding - Stack Overflow

You need to read file with utf-8 encoding, then pass to XmlReader. Answer is here. · I Used StreamReader reader = new StreamReader("~/Data.xml", ......

XMLReader::open - Manual - PHP

When using the XmlReader to read local XML files, remember it the open function requests a URI. Add 'file://' to the front of...

New Features - ImageGear for .NET - Accusoft Support

Improved Office Open XML reader for DOCX, PPTX, and XLSX formats ... Improvements in accuracy for PDF417, QR Codes, Data Matrix, and 1D...