question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Merged synsets are lost in translation

See original GitHub issue

Describe the bug

Wn loses some merged synsets in translation, even though the original CILI mappings correctly link the merged source synsets to the same target synset.

To Reproduce

For exemple, consider these two synsets in the ili-map-pwn31.tab mapping, which map to the same PWN 3.1 target:

i37881 00472688-n i37882 00472688-n

With Wn, the first synset (i37881) has no translation in OEWN, although it should, if i37881 was mapped to i37882:

import wn
wnfi = wn.Wordnet("omw-fi")
ss1 = wnfi.synsets(ili="i37881")[0]
print(f"{ss1.ili.id}, {ss1.senses()}, {ss1.translate('oewn')}")

i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], [] So the translation above is just the empty list ([]).

By contrast, the other merged synset translates correctly:

ss2 = wnfi.synsets(ili="i37882")[0]
print(f"{ss2.ili.id}, {ss2.senses()}, {ss2.translate('oewn')}")

i37882, [Sense(‘omw-fi-baseball-00474568-n’)], [Synset(‘oewn-00472688-n’)]

The same problem occurs with any other merged synsets.

Expected behavior

The first synset (i37881) would have a translation in OEWN, if the CILI mapping was used as intended.

Environment

python --version
python -m wn --version
python -m wn lexicons

Python 3.9.2 Wn 0.9.2 oewn 2021 [en] Open English WordNet omw-en 1.4 [en] OMW English Wordnet based on WordNet 3.0 omw-cmn 1.4 [cmn-Hans] Chinese Open Wordnet omw-es 1.4 [es] Multilingual Central Repository (Spanish) omw-lt 1.4 [lt] Lithuanian WordNet omw-pt 1.4 [pt] OpenWN-PT omw-id 1.4 [id] Wordnet Bahasa (Indonesian) omw-he 1.4 [he] Hebrew Wordnet omw-eu 1.4 [eu] Multilingual Central Repository (Basque) omw-sq 1.4 [sq] Albanet omw-zsm 1.4 [zsm] Wordnet Bahasa (Malaysian) omw-arb 1.4 [arb] Arabic WordNet (AWN v2) omw-ca 1.4 [ca] Multilingual Central Repository (Catalan) omw-fi 1.4 [fi] FinnWordNet omw-sv 1.4 [sv] WordNet-SALDO omw-gl 1.4 [gl] Multilingual Central Repository (Galician) omw-el 1.4 [el] Greek Wordnet omw-pl 1.4 [pl] plWordNet omw-iwn 1.4 [it] ItalWordNet omw-ro 1.4 [ro] Romanian Wordnet omw-nl 1.4 [nl] Open Dutch WordNet omw-ja 1.4 [ja] Japanese Wordnet omw-fr 1.4 [fr] WOLF (Wordnet Libre du Français) omw-sk 1.4 [sk] Slovak WordNet omw-is 1.4 [is] IceWordNet omw-it 1.4 [it] MultiWordNet (Italian) omw-hr 1.4 [hr] Croatian Wordnet omw-th 1.4 [th] Thai Wordnet omw-bg 1.4 [bg] BulTreeBank Wordnet (BTB-WN) omw-nb 1.4 [nb] Norwegian Wordnet (Bokmål) omw-da 1.4 [da] DanNet omw-nn 1.4 [nn] Norwegian Wordnet (Nynorsk) omw-sl 1.4 [sl] sloWNet

Additional Context

At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn. However, this is not easy, since a rather big detour is necessary to obtain the sense keys in the ‘oewn’ lexicon.

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
goodmamicommented, Nov 16, 2022

@ekaf I think the CILI as a resource is better thought of as the inventory of identifiers and their definitions than as a collection of mappings. The mappings to synsets should be maintained by the respective wordnet projects, although in practice we keep some mapping files in the CILI repository. Those mappings files are used when creating the WN-LMF exports of the PWN.

Let me try to describe ILI support in Wn. WN-LMF lexicons can link synsets to individual ILIs like this (example from OEWN 2021):

    <Synset id="oewn-15307914-n" ili="i117563" members="oewn-speed-n oewn-velocity-n" partOfSpeech="n" dc:subject="noun.time">
                                 ~~~~~~~~~~~~~

These ILIs are stored in Wn’s database linked to the synsets. When a second lexicon is loaded containing synsets with the same ILIs, such as this (from OMW 1.4’s Spanish wordnet):

    <Synset id="omw-es-15282696-n" ili="i117563" partOfSpeech="n" members="omw-es-velocidad-15282696-n" />
                                   ~~~~~~~~~~~~~

… then Wn is able to use the shared ILI to link the synsets across lexicons for translation or expanded relation traversal. Another thing we see is synsets with the special ILI in, which indicates that that version of the lexicon is proposing the synset as a candidate for a new ILI. For example:

    <Synset id="oewn-90002921-n" ili="in" members="oewn-snow_day-n" partOfSpeech="n" dc:subject="noun.time" dc:source="Colloquial WordNet">
                                 ~~~~~~~~

These proposed ILIs are not used for translation or expanded relation traversals. In Wn, the ILIs are represented by a class with an id, a status, and a definition. For example (here, the cili project has not been loaded in Wn):

>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> oewn.synsets('velocity')[0].ili.id  # an explicit ID
'i117563'
>>> oewn.synsets('velocity')[0].ili.status
'presupposed'
>>> oewn.synsets('velocity')[0].ili.definition()
>>> oewn.synsets('snow day')[0].ili.id  # ili="in" is special and the ID is None in Wn
>>> oewn.synsets('snow day')[0].ili.status
'proposed'
>>> oewn.synsets('snow day')[0].ili.definition()
'a day on which school or other events are cancelled due to snow'

Note:

  • The status presupposed means that the synset has an explicit ILI but there is no authoritative source to say whether the ILI is valid or not. The status proposed means that the lexicon used the special ILI in.
  • Explicit ILIs do not have ILI definitions in the lexicon, but proposed ILIs do. Note that ILI definitions are separate from synset definitions.

When the cili resource has been loaded, the presupposed statuses can change and their definitions become available:

>>> wn.download('cili')
...
>>> oewn.synsets('velocity')[0].ili.status
'active'
>>> oewn.synsets('velocity')[0].ili.definition()
'distance travelled per unit time'

The cili resource that is added here contains only a list of ILIs and their definitions (and maybe statuses in a future version: globalwordnet/cili#8), and does not contain any mappings to PWN 3.0 or 3.1 synsets.

Does that help?

0reactions
ekafcommented, Nov 17, 2022

As @goodmami wrote:

what if you translate in the other direction where the single ILI is “split” into two?

Yes, the inverse problem is that currently, when translating in the opposite direction, Wn only returns one of the merged synsets:

i2 = "i37882"
print(wn.Wordnet("oewn").synsets(ili = i2)[0].translate("omw-fi"))

[Synset(‘omw-fi-00474568-n’)]

In that case, the complete translation would be the union of the senses belonging to all the synsets obtained by reversing the ilimap from above:

def rev_dict(dic):
    rdic = {}
    for key,val in dic.items():
        if val not in rdic:
            rdic[val] = {key}
        else:
            rdic[val].add(key)
    return rdic

sources = rev_dict(ilimap)[i2]

print(f"{sources} --> {i2}")

{‘i37881’, ‘i37882’} --> i37882

print([wn.Wordnet("omw-fi").synsets(ili = i)[0].senses() for i in sources])

[[Sense(‘omw-fi-baseball-00471613-n’), Sense(‘omw-fi-baseball–peli-00471613-n’)], [Sense(‘omw-fi-baseball-00474568-n’)]]

Read more comments on GitHub >

github_iconTop Results From Across the Web

WoNeF, an improved, expanded and evaluated automatic ...
and its synsets leads to useful linguistic resources. ... lects this translation in all synsets where the words appear. ... of a word...
Read more >
a Feasibility Study of a Merge between STO and DanNet
Abstract. This paper presents a feasibility study of a merge between SprogTeknologisk Ordbase (STO), which contains morphological and.
Read more >
Cross-Linguistic Alignment of Wordnets with an Inter-Lingual ...
Dutch wordnet and WordNetl.5 are missing both in Italian and Spanish. Had we compared other wordnet pairs, the intermediate synsets would be unmatched ......
Read more >
Polylingual Wordnet | DeepAI
Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages.
Read more >
How Stable are WordNet Synsets? - CEUR-WS
because the changes (splits, merges, deletions) in the PWN synsets do not always ... synset offsets of persistent sense keys have at least...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found