Updating conllu library
See original GitHub issueHi Dan, I see in the code and in #5 that updating the conllu library is on the agenda.
I have made a few modifications on my forked version of UDify. From what I understand, parser.py contains some source code from the conllu library with a few modifications, mainly to handle multi-word tokens, where the desired output (example from fr_gsd-ud-train.conllu
) looks like:
multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']
In my forked version, I am still using the conllu library to return the annotation but do the MWT processing in a subsequent step in a process_MWTs
function. In this version, I confirmed that the outputs are the same:
multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']
I have done another few checks to make sure the data is the same, where updated is the forked version and original is the current version e.g.:
cat fr_gsd_original/vocabulary/tokens.txt | md5sum
e80f1f1e341fc5734c8f3a3d1c779c55
cat fr_gsd_updated/vocabulary/tokens.txt | md5sum
e80f1f1e341fc5734c8f3a3d1c779c55
There are a few benefits I can see from this:
- Supports most recent
conllu
library. - Reduces the amount of code needed in
parser.py
There are probably more elegant ways of going about MWT processing but I just thought I’d post it here in case you find it helpful. If you do, I can do more tests and once confirming behaviour is exactly the same, I can submit a PR.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
Finally got around to reviewing your changes. Looks great to me.
Thanks, I made a few other small changes: conllu returns tuple objects for elided tokens and multiword tokens, e.g.
(8, '.', 1)
and(105, '-', 106)
respectively. I had to add another check which sets the token id to None when the token is an elided token as well.I have confirmed that the outputs are the same for both en_ewt and fr_gsd (fr_gsd only shown here):
I will do a full run on en_ewt and fr_gsd and use the predictor to make sure everything is still working as normal there and then submit the PR!