question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Updating conllu library

See original GitHub issue

Hi Dan, I see in the code and in #5 that updating the conllu library is on the agenda.

I have made a few modifications on my forked version of UDify. From what I understand, parser.py contains some source code from the conllu library with a few modifications, mainly to handle multi-word tokens, where the desired output (example from fr_gsd-ud-train.conllu) looks like:

multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']

In my forked version, I am still using the conllu library to return the annotation but do the MWT processing in a subsequent step in a process_MWTs function. In this version, I confirmed that the outputs are the same:

multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']

I have done another few checks to make sure the data is the same, where updated is the forked version and original is the current version e.g.:

cat fr_gsd_original/vocabulary/tokens.txt | md5sum
e80f1f1e341fc5734c8f3a3d1c779c55 
cat fr_gsd_updated/vocabulary/tokens.txt | md5sum
e80f1f1e341fc5734c8f3a3d1c779c55

There are a few benefits I can see from this:

  1. Supports most recent conllu library.
  2. Reduces the amount of code needed in parser.py

There are probably more elegant ways of going about MWT processing but I just thought I’d post it here in case you find it helpful. If you do, I can do more tests and once confirming behaviour is exactly the same, I can submit a PR.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Hyperparticlecommented, May 22, 2020

Finally got around to reviewing your changes. Looks great to me.

1reaction
jbrrycommented, Mar 17, 2020

Thanks, I made a few other small changes: conllu returns tuple objects for elided tokens and multiword tokens, e.g. (8, '.', 1) and (105, '-', 106) respectively. I had to add another check which sets the token id to None when the token is an elided token as well.

I have confirmed that the outputs are the same for both en_ewt and fr_gsd (fr_gsd only shown here):

jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_updated/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | wc -l
44020
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_updated/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | md5sum
c489d1a0890b84e6f30272feca0905f2

jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_original/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | wc -l
44020
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_original/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | md5sum
c489d1a0890b84e6f30272feca0905f2

I will do a full run on en_ewt and fr_gsd and use the predictor to make sure everything is still working as normal there and then submit the PR!

Read more comments on GitHub >

github_iconTop Results From Across the Web

conllu - PyPI
CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. ... This means that updating from 0.1 to 1.0 might require...
Read more >
PyConll - GitHub
A minimal, pure Python library to interface with CoNLL-U format files. - GitHub - pyconll/pyconll: A minimal, pure Python library to interface with...
Read more >
pyconll Documentation
pyconll is a low level wrapper around the CoNLL-U format. ... To install the library, run pip install pyconll from your python enlistment....
Read more >
UD tools - Universal Dependencies
conllu is a python library that parses a CoNLL-U string into a nested python dictionary. It's easily installable with “pip install conllu”, has...
Read more >
Parsing CoNLL-U annotations using Python - YouTube
In this video, I show you how to parse CoNLL-U annotations using Python.✨ Check out the learning materials associated with this video: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found