question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different embeddings with different length

See original GitHub issue

I faced issue that while encoding same sentence but in lists of different length i receive slightly different embeddings.

Here is a code to describe what I meant:

from laserembeddings import Laser

import numpy as np

laser = Laser()
a = laser.embed_sentences(["apple", "banana", "clementina"], lang='en')
b = laser.embed_sentences(["apple"], lang='en')
c = laser.embed_sentences(["apple", "potato", "strawberry"], lang='en')
(a[0]==b[0]).all() # check if all elemnts same
#False
(a[0]==c[0]).all()
#True
np.linalg.norm(a[0]-b[0])
#1.3968409e-07
np.linalg.norm(a[0]-c[0])
#0.0

My goal is to get same embedding of word sentence apple, no matter which size of text list I use - but it seems to be unreal with current version of laserembeddings. I would like to know if such behavior is intentional or it’s a bug?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
yannvgncommented, Mar 1, 2020

Ok, but your batch is relatively small (3 elements). If you try with more sentences, you should see that the batched version is faster:

import timeit
from laserembeddings import Laser

laser = Laser()
laser_1 = Laser(embedding_options={'max_sentences': 1})

# 300 sentences
sentences = ['apple', 'potato', 'strawberry'] * 100

def init():
    laser.embed_sentences(['init'], lang='en')

def test_with_batch():
    laser.embed_sentences(sentences, lang='en')

def test_without_batch():
    laser_1.embed_sentences(sentences, lang='en')

# make sure everything is initialized
init()

print(timeit.timeit(test_with_batch, number=10))
# 4.908349000004819

print(timeit.timeit(test_without_batch, number=10))
# 66.42792240000563
0reactions
vchulskicommented, Mar 1, 2020

Ok, but your batch is relatively small (3 elements).

That’s a valid point. Thanks for example you provided and all of your answers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to overcome training example's different lengths when ...
Let me suggest three simple options: average the vectors (component-wise), i.e., compute the word embedding vector for each word in the text ...
Read more >
Cosine Similarity of Vectors of different lengths? - Stack Overflow
the right approach? I would think that cosine similarity would work with vectors of different lengths.
Read more >
Learning variable-length representation of words
Variable length embedding potentially helps removing bias (over-fitting) on certain datasets. •. Proposed approach outperforms fixed-length embedding, and also ...
Read more >
How to compare feature vectors with different lengths?
How to compare feature vectors with different lengths? ... Input: Feature vectors (row vectors) of varying lengths (Max. length = 30, Min. length...
Read more >
different lengths for the document vectors and word vectors in ...
Doc2Vec allow use texts with different length as input of `Doc2Vec` ... word embeddings and 64-dimensional document embeddings.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found