Duplicate aliases
See original GitHub issueHi @DeNeutoy and team!
This is just a general question, but I was wondering about the duplicate aliases in the UMLS linker. For example, this is the entry for C3657270
(Nivolumab):
CUI: C3657270, Name: nivolumab
Definition: A fully human immunoglobulin (Ig) G4 monoclonal antibody directed against the negative immunoregulatory human cell surface receptor programmed death-1 (PD-1, PCD-1) with immune checkpoint inhibitory and antineoplastic activities. Upon administration, nivolumab binds to and blocks the activation of PD-1, an immunoglobulin superfamily (IgSF) transmembrane protein, by its ligands programmed cell death ligand 1 (PD-L1), which is overexpressed on certain cancer cells, and programmed cell death ligand 2 (PD-L2), which is primarily expressed on antigen-presenting cells (APCs). This results in the activation of T-cells and cell-mediated immune responses against tumor cells. Activated PD-1 negatively regulates T-cell activation and plays a key role in tumor evasion from host immunity.
TUI(s): T116, T121, T129
Aliases (abbreviated, total: 19):
nivolumab, nivolumab, nivolumab, nivolumab, nivolumab, Nivolumab, Nivolumab, Nivolumab, Nivolumab, Nivolumab
The 19 aliases listed for Nivolumab are the following, which contain quite some duplicates:
aliases = linker.umls.cui_to_entity["C3657270"].aliases
print(aliases)
> ['nivolumab', 'nivolumab', 'nivolumab', 'nivolumab', 'nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'NIVOLUMAB', 'NIVOLUMAB', 'NIVOLUMAB', 'Nivolumab (substance)', 'NIVO', 'NIVO', 'Product containing nivolumab (medicinal product)', 'Nivolumab-containing product']
print(set(aliases))
> {'NIVOLUMAB', 'nivolumab', 'NIVO', 'Nivolumab', 'Nivolumab (substance)', 'Nivolumab-containing product', 'Product containing nivolumab (medicinal product)'}
Why are there duplicates in these lists? Do these maybe originate from the different vocabularies in UMLS (corresponding to the atoms)? And related to this question: could it make the entity linker more efficient if these aliases were de-duplicated?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Duplicate Aliases in Free-Form Mode - Oracle Help Center
Duplicate aliases are supported across dimensions and within dimensions. · Aliases can have the same name as a member. · Member names are...
Read more >Duplicate alias - Common causes and quick fixes - Opster
A detailed guide on how to resolve errors related to "Duplicate alias" ... Index Aliases: An index alias points to one or more...
Read more >Duplicate alias entities created with 'Create a new ... - Drupal
Pathauto 1.6 no longer checks whether a 'new' alias is identical to an existing alias when the Update action is set to 'Create...
Read more >Duplicate Aliases - Tableau Community
I changed some of the variable names in my data (e.g., abc to abcd), and now I'm having some trouble updating the aliases....
Read more >Can I have duplicate usernames or username aliases in Duo?
If you import a CSV file containing user information that duplicates usernames or aliases, then the import process skips any rows containing duplicated...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
By default strings are converted into lowercase by TfidfVectorizer
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
So isn’t it that when removing duplicate aliases, we should ignore the case? In that case, in the example mentioned by @ChantalvanSon
'NIVOLUMAB', 'nivolumab', 'Nivolumab'
becomes same? So this can lead to further reduction in size of concept_aliases.jsonTfIdf vectorizer is called at https://github.com/allenai/scispacy/blob/master/scispacy/candidate_generation.py#L410
which means we are using the default value for the parameter
lowercase
A question:
@DeNeutoy @danielkingai2 As we change the list of concept aliases, it would also change the vector representation of these concept aliases since the document frequency of the char trigram vocabulary also changes. Isn’t that going to impact the similarity score of entity candidate with the concept aliases?
@DeNeutoy @danielkingai2 In case you have missed it, I have made a suggestion on creating unique aliases in the conversation of https://github.com/allenai/scispacy/pull/274#issuecomment-726096229