question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use XMLReader for Open Greek and Latin

See original GitHub issue

I forked the First1KGreek repo and added it to the Greek corpora, which is in TEI.

At the suggestion of @diyclassics I have looked into using the NLTK’s XML corpus reader. Here’s what I see:

In [1]: from nltk.corpus.reader import XMLCorpusReader

In [2]: f = '/Users/kyle.p.johnson/cltk_data/greek/text/greek_text_first1kgreek/
   ...: data/tlg0015/tlg001/tlg0015.tlg001.opp-grc1.xml'

In [3]: import os

In [4]: path, name = os.path.split(f)

In [5]: path
Out[5]: '/Users/kyle.p.johnson/cltk_data/greek/text/greek_text_first1kgreek/data/tlg0015/tlg001'

In [6]: name
Out[6]: 'tlg0015.tlg001.opp-grc1.xml'

In [7]: reader = XMLCorpusReader(path, name)

In [8]: reader.raw()[:100]
Out[8]: '<?xml version="1.0" encoding="UTF-8"?>\n<?xml-model href="http://www.stoa.org/epidoc/schema/latest/te'

In [9]: reader.raw()[:200]
Out[9]: '<?xml version="1.0" encoding="UTF-8"?>\n<?xml-model href="http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>\n<TEI xmlns="http://www.tei-c.org/'

In [10]: reader.words()[:10]
Out[10]: 
['Ab',
 'excessu',
 'divi',
 'Marci',
 'Herodian',
 'Immanuel',
 'Bekker',
 'European',
 'Social',
 'Fund']

Since .words() returns metadata, we will need something else.

What do others think about parsing TEI XML? Do we need to parse our selves using beautiful soup or the core xml library?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
kylepjohnsoncommented, Jul 12, 2017

Another update: I have been forwarded this code, which I’ll use to process the XML: https://github.com/Capitains/HookTest/blob/master/HookTest/build.py#L91

1reaction
kylepjohnsoncommented, Jul 12, 2017

Update: This repo actually has plain text files in it, already: https://github.com/cltk/First1KGreek/tree/1.1.1603/text

However an XML reader would still be desirable to users, I imagine.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Open Greek & Latin – An international collaboration ...
An international collaboration committed to creating an open educational resource featuring a corpus of digital texts, deep-reading tools, and open-source ...
Read more >
Parsing xml with unicode characters - Oracle Communities
Hi, I'm building applications that use both latin and cyrillic characters. ... and latin characters: XMLReader parser = (XMLReader)Class.
Read more >
Read XML file with UTF-8 Encoding - Stack Overflow
You need to read file with utf-8 encoding, then pass to XmlReader. Answer is here. · I Used StreamReader reader = new StreamReader("~/Data.xml", ......
Read more >
XMLReader::open - Manual - PHP
When using the XmlReader to read local XML files, remember it the open function requests a URI. Add 'file://' to the front of...
Read more >
New Features - ImageGear for .NET - Accusoft Support
Improved Office Open XML reader for DOCX, PPTX, and XLSX formats ... Improvements in accuracy for PDF417, QR Codes, Data Matrix, and 1D...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found