question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate aliases

See original GitHub issue

Hi @DeNeutoy and team!

This is just a general question, but I was wondering about the duplicate aliases in the UMLS linker. For example, this is the entry for C3657270 (Nivolumab):

CUI: C3657270, Name: nivolumab
Definition: A fully human immunoglobulin (Ig) G4 monoclonal antibody directed against the negative immunoregulatory human cell surface receptor programmed death-1 (PD-1, PCD-1) with immune checkpoint inhibitory and antineoplastic activities. Upon administration, nivolumab binds to and blocks the activation of PD-1, an immunoglobulin superfamily (IgSF) transmembrane protein, by its ligands programmed cell death ligand 1 (PD-L1), which is overexpressed on certain cancer cells, and programmed cell death ligand 2 (PD-L2), which is primarily expressed on antigen-presenting cells (APCs). This results in the activation of T-cells and cell-mediated immune responses against tumor cells. Activated PD-1 negatively regulates T-cell activation and plays a key role in tumor evasion from host immunity.
TUI(s): T116, T121, T129
Aliases (abbreviated, total: 19): 
         nivolumab, nivolumab, nivolumab, nivolumab, nivolumab, Nivolumab, Nivolumab, Nivolumab, Nivolumab, Nivolumab

The 19 aliases listed for Nivolumab are the following, which contain quite some duplicates:

aliases = linker.umls.cui_to_entity["C3657270"].aliases
print(aliases)
> ['nivolumab', 'nivolumab', 'nivolumab', 'nivolumab', 'nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'Nivolumab', 'NIVOLUMAB', 'NIVOLUMAB', 'NIVOLUMAB', 'Nivolumab (substance)', 'NIVO', 'NIVO', 'Product containing nivolumab (medicinal product)', 'Nivolumab-containing product']
print(set(aliases))
> {'NIVOLUMAB', 'nivolumab', 'NIVO', 'Nivolumab', 'Nivolumab (substance)', 'Nivolumab-containing product', 'Product containing nivolumab (medicinal product)'}

Why are there duplicates in these lists? Do these maybe originate from the different vocabularies in UMLS (corresponding to the atoms)? And related to this question: could it make the entity linker more efficient if these aliases were de-duplicated?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
kaushikacharyacommented, Feb 12, 2021

By default strings are converted into lowercase by TfidfVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

lowercase : bool, default=True
Convert all characters to lowercase before tokenizing.

So isn’t it that when removing duplicate aliases, we should ignore the case? In that case, in the example mentioned by @ChantalvanSon 'NIVOLUMAB', 'nivolumab', 'Nivolumab' becomes same? So this can lead to further reduction in size of concept_aliases.json

TfIdf vectorizer is called at https://github.com/allenai/scispacy/blob/master/scispacy/candidate_generation.py#L410

    tfidf_vectorizer = TfidfVectorizer(
        analyzer="char_wb", ngram_range=(3, 3), min_df=10, dtype=numpy.float32
    )

which means we are using the default value for the parameter lowercase

A question:

@DeNeutoy @danielkingai2 As we change the list of concept aliases, it would also change the vector representation of these concept aliases since the document frequency of the char trigram vocabulary also changes. Isn’t that going to impact the similarity score of entity candidate with the concept aliases?

0reactions
kaushikacharyacommented, Nov 14, 2020

@DeNeutoy @danielkingai2 In case you have missed it, I have made a suggestion on creating unique aliases in the conversation of https://github.com/allenai/scispacy/pull/274#issuecomment-726096229

Read more comments on GitHub >

github_iconTop Results From Across the Web

Duplicate Aliases in Free-Form Mode - Oracle Help Center
Duplicate aliases are supported across dimensions and within dimensions. · Aliases can have the same name as a member. · Member names are...
Read more >
Duplicate alias - Common causes and quick fixes - Opster
A detailed guide on how to resolve errors related to "Duplicate alias" ... Index Aliases: An index alias points to one or more...
Read more >
Duplicate alias entities created with 'Create a new ... - Drupal
Pathauto 1.6 no longer checks whether a 'new' alias is identical to an existing alias when the Update action is set to 'Create...
Read more >
Duplicate Aliases - Tableau Community
I changed some of the variable names in my data (e.g., abc to abcd), and now I'm having some trouble updating the aliases....
Read more >
Can I have duplicate usernames or username aliases in Duo?
If you import a CSV file containing user information that duplicates usernames or aliases, then the import process skips any rows containing duplicated...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found