Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
See original GitHub issueHi there,
I want to exploit semantic search through cosine similarity and to do so, I have prepared the following datasets:
Queries: <class ‘list’> 179435 Corpus embeddings: <class ‘numpy.ndarray’> (31257735, 128) Corpus: <class ‘list’> 31257735
Although I could run the same code on Google Colab (different embedding size: 768), pytorch_cos_sim stuck and threw the following error on the server:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-18-9f8d1c8ab6d4> in <module>
5
6 # We use cosine-similarity and torch.topk to find the highest 5 scores
----> 7 cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
8 top_results = torch.topk(cos_scores, k=top_k)
9
~/anaconda3/envs/method2/lib/python3.8/site-packages/sentence_transformers/util.py in pytorch_cos_sim(a, b)
19 :return: Matrix with res[i][j] = cos_sim(a[i], b[j])
20 """
---> 21 return cos_sim(a, b)
22
23 def cos_sim(a: Tensor, b: Tensor):
~/anaconda3/envs/method2/lib/python3.8/site-packages/sentence_transformers/util.py in cos_sim(a, b)
40 a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
41 b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
---> 42 return torch.mm(a_norm, b_norm.transpose(0, 1))
43
44
RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)
I was wondering if you could elaborate more on how to debug the error, please?
Let me just add that due to the lack of memory, I employed PCA for dimensionality reduction.
Regards, Javad
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (14 by maintainers)
Top Results From Across the Web
sentence-transformers/semantic_search.py at master - GitHub
Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity. top_k = min(5, len(corpus)). for query in...
Read more >Semantic Search — Sentence-Transformers documentation
Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity top_k = min(5, len(corpus)) for query in...
Read more >BERT For Measuring Text Similarity - Towards Data Science
Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them — more on that here.
Read more >convert cosine similarity embeddings from matrix to pandas ...
I have tested the cosine similarity matrix using the below code # Find the closest 5 sentences of the corpus for each query...
Read more >Cosine Similarity – Understanding the math and how it works ...
A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents. But this ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
32D might be a bit too little, see: https://arxiv.org/abs/2012.14210
Otherwise sounds good.
You could also use GPU0 for your model and GPU1/2 to store the corpus embeddings.
10M embeddings with 768 dim and float 32 require 30GB memory. With fp16 it will be 15GB, but then you have no memory left for computation.
You can try to minimize the embedding size (see our docs).
Or using ANN with faiss or hnswlib.