Merged synsets are lost in translation
See original GitHub issueDescribe the bug
Wn loses some merged synsets in translation, even though the original CILI mappings correctly link the merged source synsets to the same target synset.
To Reproduce
For exemple, consider these two synsets in the ili-map-pwn31.tab mapping, which map to the same PWN 3.1 target:
i37881 00472688-n i37882 00472688-n
With Wn, the first synset (i37881) has no translation in OEWN, although it should, if i37881 was mapped to i37882:
import wn
wnfi = wn.Wordnet("omw-fi")
ss1 = wnfi.synsets(ili="i37881")[0]
print(f"{ss1.ili.id}, {ss1.senses()}, {ss1.translate('oewn')}")
i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], []
So the translation above is just the empty list ([]).
By contrast, the other merged synset translates correctly:
ss2 = wnfi.synsets(ili="i37882")[0]
print(f"{ss2.ili.id}, {ss2.senses()}, {ss2.translate('oewn')}")
i37882, [Sense(‘omw-fi-baseball-00474568-n’)], [Synset(‘oewn-00472688-n’)]
The same problem occurs with any other merged synsets.
Expected behavior
The first synset (i37881) would have a translation in OEWN, if the CILI mapping was used as intended.
Environment
python --version
python -m wn --version
python -m wn lexicons
Python 3.9.2 Wn 0.9.2 oewn 2021 [en] Open English WordNet omw-en 1.4 [en] OMW English Wordnet based on WordNet 3.0 omw-cmn 1.4 [cmn-Hans] Chinese Open Wordnet omw-es 1.4 [es] Multilingual Central Repository (Spanish) omw-lt 1.4 [lt] Lithuanian WordNet omw-pt 1.4 [pt] OpenWN-PT omw-id 1.4 [id] Wordnet Bahasa (Indonesian) omw-he 1.4 [he] Hebrew Wordnet omw-eu 1.4 [eu] Multilingual Central Repository (Basque) omw-sq 1.4 [sq] Albanet omw-zsm 1.4 [zsm] Wordnet Bahasa (Malaysian) omw-arb 1.4 [arb] Arabic WordNet (AWN v2) omw-ca 1.4 [ca] Multilingual Central Repository (Catalan) omw-fi 1.4 [fi] FinnWordNet omw-sv 1.4 [sv] WordNet-SALDO omw-gl 1.4 [gl] Multilingual Central Repository (Galician) omw-el 1.4 [el] Greek Wordnet omw-pl 1.4 [pl] plWordNet omw-iwn 1.4 [it] ItalWordNet omw-ro 1.4 [ro] Romanian Wordnet omw-nl 1.4 [nl] Open Dutch WordNet omw-ja 1.4 [ja] Japanese Wordnet omw-fr 1.4 [fr] WOLF (Wordnet Libre du Français) omw-sk 1.4 [sk] Slovak WordNet omw-is 1.4 [is] IceWordNet omw-it 1.4 [it] MultiWordNet (Italian) omw-hr 1.4 [hr] Croatian Wordnet omw-th 1.4 [th] Thai Wordnet omw-bg 1.4 [bg] BulTreeBank Wordnet (BTB-WN) omw-nb 1.4 [nb] Norwegian Wordnet (Bokmål) omw-da 1.4 [da] DanNet omw-nn 1.4 [nn] Norwegian Wordnet (Nynorsk) omw-sl 1.4 [sl] sloWNet
Additional Context
At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn. However, this is not easy, since a rather big detour is necessary to obtain the sense keys in the ‘oewn’ lexicon.
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (2 by maintainers)
@ekaf I think the CILI as a resource is better thought of as the inventory of identifiers and their definitions than as a collection of mappings. The mappings to synsets should be maintained by the respective wordnet projects, although in practice we keep some mapping files in the CILI repository. Those mappings files are used when creating the WN-LMF exports of the PWN.
Let me try to describe ILI support in Wn. WN-LMF lexicons can link synsets to individual ILIs like this (example from OEWN 2021):
These ILIs are stored in Wn’s database linked to the synsets. When a second lexicon is loaded containing synsets with the same ILIs, such as this (from OMW 1.4’s Spanish wordnet):
… then Wn is able to use the shared ILI to link the synsets across lexicons for translation or expanded relation traversal. Another thing we see is synsets with the special ILI
in
, which indicates that that version of the lexicon is proposing the synset as a candidate for a new ILI. For example:These proposed ILIs are not used for translation or expanded relation traversals. In Wn, the ILIs are represented by a class with an id, a status, and a definition. For example (here, the
cili
project has not been loaded in Wn):Note:
presupposed
means that the synset has an explicit ILI but there is no authoritative source to say whether the ILI is valid or not. The statusproposed
means that the lexicon used the special ILIin
.When the
cili
resource has been loaded, thepresupposed
statuses can change and their definitions become available:The
cili
resource that is added here contains only a list of ILIs and their definitions (and maybe statuses in a future version: globalwordnet/cili#8), and does not contain any mappings to PWN 3.0 or 3.1 synsets.Does that help?
As @goodmami wrote:
Yes, the inverse problem is that currently, when translating in the opposite direction, Wn only returns one of the merged synsets:
In that case, the complete translation would be the union of the senses belonging to all the synsets obtained by reversing the ilimap from above:
print([wn.Wordnet("omw-fi").synsets(ili = i)[0].senses() for i in sources])