question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity

See original GitHub issue

Hi there,

I want to exploit semantic search through cosine similarity and to do so, I have prepared the following datasets:

Queries: <class ‘list’> 179435 Corpus embeddings: <class ‘numpy.ndarray’> (31257735, 128) Corpus: <class ‘list’> 31257735

Although I could run the same code on Google Colab (different embedding size: 768), pytorch_cos_sim stuck and threw the following error on the server:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-18-9f8d1c8ab6d4> in <module>
      5 
      6     # We use cosine-similarity and torch.topk to find the highest 5 scores
----> 7     cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
      8     top_results = torch.topk(cos_scores, k=top_k)
      9 

~/anaconda3/envs/method2/lib/python3.8/site-packages/sentence_transformers/util.py in pytorch_cos_sim(a, b)
     19     :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
     20     """
---> 21     return cos_sim(a, b)
     22 
     23 def cos_sim(a: Tensor, b: Tensor):

~/anaconda3/envs/method2/lib/python3.8/site-packages/sentence_transformers/util.py in cos_sim(a, b)
     40     a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
     41     b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
---> 42     return torch.mm(a_norm, b_norm.transpose(0, 1))
     43 
     44 

RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

I was wondering if you could elaborate more on how to debug the error, please?

Let me just add that due to the lack of memory, I employed PCA for dimensionality reduction.

Regards, Javad

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
nreimerscommented, Apr 5, 2021

32D might be a bit too little, see: https://arxiv.org/abs/2012.14210

Otherwise sounds good.

You could also use GPU0 for your model and GPU1/2 to store the corpus embeddings.

1reaction
nreimerscommented, Apr 4, 2021

10M embeddings with 768 dim and float 32 require 30GB memory. With fp16 it will be 15GB, but then you have no memory left for computation.

You can try to minimize the embedding size (see our docs).

Or using ANN with faiss or hnswlib.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sentence-transformers/semantic_search.py at master - GitHub
Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity. top_k = min(5, len(corpus)). for query in...
Read more >
Semantic Search — Sentence-Transformers documentation
Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity top_k = min(5, len(corpus)) for query in...
Read more >
BERT For Measuring Text Similarity - Towards Data Science
Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them — more on that here.
Read more >
convert cosine similarity embeddings from matrix to pandas ...
I have tested the cosine similarity matrix using the below code # Find the closest 5 sentences of the corpus for each query...
Read more >
Cosine Similarity – Understanding the math and how it works ...
A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents. But this ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found