Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different embeddings with different length

See original GitHub issue

I faced issue that while encoding same sentence but in lists of different length i receive slightly different embeddings.

Here is a code to describe what I meant:

from laserembeddings import Laser

import numpy as np

laser = Laser()
a = laser.embed_sentences(["apple", "banana", "clementina"], lang='en')
b = laser.embed_sentences(["apple"], lang='en')
c = laser.embed_sentences(["apple", "potato", "strawberry"], lang='en')
(a[0]==b[0]).all() # check if all elemnts same
#False
(a[0]==c[0]).all()
#True
np.linalg.norm(a[0]-b[0])
#1.3968409e-07
np.linalg.norm(a[0]-c[0])
#0.0

My goal is to get same embedding of word sentence apple, no matter which size of text list I use - but it seems to be unreal with current version of laserembeddings. I would like to know if such behavior is intentional or it’s a bug?

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

yannvgncommented, Mar 1, 2020

Ok, but your batch is relatively small (3 elements). If you try with more sentences, you should see that the batched version is faster:

import timeit
from laserembeddings import Laser

laser = Laser()
laser_1 = Laser(embedding_options={'max_sentences': 1})

# 300 sentences
sentences = ['apple', 'potato', 'strawberry'] * 100

def init():
    laser.embed_sentences(['init'], lang='en')

def test_with_batch():
    laser.embed_sentences(sentences, lang='en')

def test_without_batch():
    laser_1.embed_sentences(sentences, lang='en')

# make sure everything is initialized
init()

print(timeit.timeit(test_with_batch, number=10))
# 4.908349000004819

print(timeit.timeit(test_without_batch, number=10))
# 66.42792240000563

0reactions

vchulskicommented, Mar 1, 2020

Ok, but your batch is relatively small (3 elements).

That’s a valid point. Thanks for example you provided and all of your answers.

Top Results From Across the Web

How to overcome training example's different lengths when ...

Let me suggest three simple options: average the vectors (component-wise), i.e., compute the word embedding vector for each word in the text ...

Cosine Similarity of Vectors of different lengths? - Stack Overflow

the right approach? I would think that cosine similarity would work with vectors of different lengths.

Learning variable-length representation of words

Variable length embedding potentially helps removing bias (over-fitting) on certain datasets. •. Proposed approach outperforms fixed-length embedding, and also ...

How to compare feature vectors with different lengths?

How to compare feature vectors with different lengths? ... Input: Feature vectors (row vectors) of varying lengths (Max. length = 30, Min. length...

different lengths for the document vectors and word vectors in ...

Doc2Vec allow use texts with different length as input of `Doc2Vec` ... word embeddings and 64-dimensional document embeddings.