question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to generate embeddings for new candidates?

See original GitHub issue

Hi,

I am been going through the code, documentation and issues to figure out how to obtain embeddings for new candidates - however I have not been able to figure this out.

I would like to add new candidates to all_entities_large.t7 file.

Firstly, the script generate_candidates.py is supposed to generate the embeddings given the token_idxs of new entities (the input parameters saved_cand_ids refers to a file that has these token_idxs), however, it is not clear how to generate these token_idxs.

So, I tried to reverse engineer generating embeddings for the following entity in entity.jsonl file:

{
  "text": " Aristotle (; \"Aristoteles\", ; 384–322 BC) was a Greek philosopher during the Classical period in Ancient Greece, the founder of the Lyceum and the Peripatetic school of philosophy and Aristotelian tradition. Along with his teacher Plato, he has been called the \"Father of Western Philosophy\". His writings cover many subjects – including physics, biology, zoology, metaphysics, logic, ethics, aesthetics, poetry, theatre, music, rhetoric, psychology, linguistics, economics, politics and government. Aristotle provided a complex synthesis of the various philosophies existing prior to him, and it was above all from his teachings that the West inherited its intellectual lexicon, as well as problems and methods of inquiry. As a result, his philosophy has exerted a unique influence on almost every form of knowledge in the West and it continues to be a subject of contemporary philosophical discussion.  Little is known about his life. Aristotle was born in the city of Stagira in Northern Greece. His father, Nicomachus, died when Aristotle was a child, and he was brought up by a guardian. At seventeen or eighteen years of age, he joined Plato's Academy in Athens and remained there until the age of thirty-seven (c. 347 BC). Shortly after Plato died, Aristotle left Athens and, at the request of Philip II of Macedon, tutored Alexander the Great beginning in 343 BC. He established a library in the Lyceum which helped him to produce many of his hundreds of books on papyrus scrolls. Though Aristotle wrote many elegant treatises and dialogues for publication, only around a third of his original",
  "idx": "https://en.wikipedia.org/wiki?curid=308",
  "title": "Aristotle",
  "entity": "Aristotle",
}

Firing up main_dense.py in interactive mode and submitting the above text produces the following named entities (persons only):

image

I then tried running the samples corresponding to Aristotle mentions through both context and candidate encoder parts of BiEncoder and saved the embeddings to the disk, however, they are all different from the one in all_entities_large.t7.

Are we supposed to average the embeddings of all the mentions corresponding to Aristotle entity? Or any other logic?

The BLINK paper says, the embeddings for candidates were generated by taking first 10 lines from their Wikipedia description, however, only 32 tokens are submitted to encoder to obtain an embedding, so not sure why 10 lines were selected.

Thanks!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
abhinavkulkarnicommented, Jan 19, 2022

Thanks to @ledw-2 and others from other issues, I was able to recreate embeddings for existing entities (in entity.json) using their Wikipedia description and title and was able to verify that they match those in all_entities.t7 up to the 6th decimal point.

Given a new entity title and its description, here’s how to generate its embeddings:


# Load biencoder model and biencoder params just like in main_dense.py
with open(args.biencoder_config) as json_file:
    biencoder_params = json.load(json_file)
    biencoder_params["path_to_model"] = args.biencoder_model
biencoder = load_biencoder(biencoder_params)

# Read 10 entities from entity.jsonl
entities = []
count = 10
with open('./models/entity.jsonl') as f:
    for i, line in enumerate(f):
        entity = json.loads(line)
        entities.append(entity)
        if i == count-1:
            break

# Get token_ids corresponding to candidate title and description
tokenizer = biencoder.tokenizer
max_context_length, max_cand_length =  biencoder_params["max_context_length"], biencoder_params["max_cand_length"]
max_seq_length = max_cand_length
ids = []

for entity in entities:
    candidate_desc = entity['text']
    candidate_title = entity['title']
    cand_tokens = get_candidate_representation(
        candidate_desc, 
        tokenizer, 
        max_seq_length, 
        candidate_title=candidate_title
    )

    token_ids = cand_tokens["ids"]
    ids.append(token_ids)

ids = torch.tensor(ids)
torch.save(ids, path)

The file in which these ids are saved should be passed in the --saved_cand_ids param of scripts/generate_candidates.py.

Thanks to the FB team for this awesome project!

0reactions
amelieyu1989commented, Nov 10, 2022

I see. you mean I could get my new_encode_list = torch.cat((old_encode_list, new_entities_tokens)) could you share code if possible?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Making personnel selection smarter through word embeddings
The aim of our work is to build an ML model that is able to estimate how relevant or qualified a candidate employee...
Read more >
Personalized Fishbowl Recommendations with Learned ...
Word2Vec (SkipGram) generates embeddings in an unsupervised manner where ... On average each user has about 200 new candidate posts to rank per...
Read more >
how to generate embeddings for all entities after we ... - GitHub
With regards to training a new model with custom data, yes, it is indeed possible to do so. I would recommend first training...
Read more >
Candidate Generation - Grokking the Machine Learning ...
In this lesson, we will be looking at a few techniques to generate media candidates that will match user interests based on the...
Read more >
Document Embedding Techniques - Towards Data Science
Word embedding — the mapping of words into numerical vector spaces — has proved to be an incredibly important method for natural language ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found