Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BertEmbeddings doesn't generate an embedding for every token

See original GitHub issue

Hello, this might have a simple answer —

Under what circumstances will BertEmbeddings not generate an embedding for a token?

Description

I am tokenizing the following document, and noticing that the number of embeddings outputted by BertEmbeddings is less than the number of tokens outputted by SparkNLP’s tokenizer.

Expected Behavior

I expect that the number of embeddings returned by BERTEmbeddings would be exactly equal to the number of tokens returned by the tokenizer.

Current Behavior

In the example sentence I’ll give below, BertEmbeddings returns a list of 906 embeddings for 916 tokens.

Possible Solution

If there is a sensible reason for a token not to be embedded, it would at least be nice for a placeholder to be returned, so the length of the list of embeddings is equal to the length of the list of tokens.

Steps to Reproduce

Input document:

df = pd.DataFrame({'summary': ['Why is this man smiling? President Obama’s chosen successor suffered a devastating loss last week to a man who made a primary campaign issue of Obama’s “disastrous” management of the country. The Democratic Party is in a shambles, outnumbered in state legislatures, governors’ mansions, the House and the Senate. Conservative control of the Supreme Court seems likely for another generation. Obama’s legacy is in tatters, as his trade policy, his foreign policy and his beloved Obamacare are set to be dismantled. And yet when Obama entered the White House briefing room for a post-election news conference Monday afternoon, everything was, if not awesome, then pretty darned good. “We are indisputably in a stronger position today than we were when I came in eight years ago,” he began. “Jobs have been growing for 73 straight months, incomes are rising, poverty is falling, the uninsured rate is at the lowest level on record, carbon emissions have come down without impinging on our growth…” The happy talk kept coming: “Unemployment rate is low as it has been in eight, nine years, incomes and wages have both gone up over the last year faster than they have in a decade or two… The financial systems are stable. The stock market is hovering around its all-time high and 401(k)s have been restored. The housing market has recovered… We are seeing significant progress in Iraq. .. Our alliances are in strong shape. ..And gas is two bucks a gallon.” It’s all true enough. But Obama’s post-election remarks seemed utterly at odds with the national mood. Half the country is exultant because Donald Trump has promised to undo everything Obama has done over the last year. The other half of the country is alarmed that a new age of bigotry and inwardness has seized the country. And here’s the outgoing president, reciting what a fine job he has done. This has been Obama’s pattern. At times when passion is called for, he’s cerebral and philosophical and taking the long view — so long that it frustrates those living in the present. A week after an election has left his supporters reeling, Obama’s focus seemed to be squarely on his own legacy. He didn’t mention Hillary Clinton’s name once in his news conference, and he went out of his way to praise Trump. On a day when the country was digesting the news that Trump has named as his top White House strategist Stephen K. Bannon, a man who has boasted of his ties to the racist “alt-right,” Obama was generous to the “carnival barker” who led the campaign questioning his American birth. Of the Bannon appointment, Obama said “it would not be appropriate for me to comment,” and “those who didn’t vote for him have to recognize that that’s how democracy works.” Of Trump himself, Obama noted “his gifts that obviously allowed him to execute one of the biggest political upsets in history.” He praised Trump as “gregarious” and “pragmatic,” a man who favors “a vigorous debate” and was “impressive” during the campaign. “That connection that he was able to make with his supporters,” Obama said, was “powerful stuff.” Obama’s above-the-fray response to the election result may well be that of a man who believes his approach will be vindicated by history. It may well be, but that is of little comfort now. As Obama retires to a life of speaking fees and good works, he sounded less concerned about what will happen next than with what he had achieved — including a mention, for those who forgot, that he won the Iowa caucus in 2008. He took a bow for his “smartest, hardest-working” staff, his “good decisions,” the absence of “significant scandal” during his tenure. And he speculated that Trump would ultimately find it wise to leave intact the key achievements of his administration: Obamacare, the Iran nuclear deal, the Paris climate accord, trade and immigration. The deep disenchantment among white, blue-collar voters that propelled Trump won only a passing mention. “Obviously there are people out there who are feeling deeply disaffected,” the president said with his cool detachment. In an election this close — Clinton, let’s not forget, won the popular vote — any factor could have made the difference: being a candidate of the establishment in a time of change, resistance to a woman as president and backlash against the first black president, and James Comey’s last-minute intervention in the election. But millions of Americans are justifiably anxious about their economic well-being. And if Clinton and Obama had limited the build-on-success theme during the campaign in favor of a more populist vision and policies, they really would have something to smile about this week. Twitter: @Milbank Read more from Dana Milbank’s archive, follow him on Twitter or subscribe to his updates on Facebook.']})



documenter = (
    sb.DocumentAssembler()
        .setInputCol("summary")
        .setOutputCol("document")
)

sentencer = (
    sa.SentenceDetector()
        .setInputCols(["document"])
        .setOutputCol("sentences")            
)

tokenizer = (
    sa.Tokenizer()
        .setInputCols(["sentences"])
        .setOutputCol("token")
)

word_embeddings = (
    sa.BertEmbeddings
        .load('s3://aspangher/spark-nlp/small_bert_L4_128_en_2.6.0_2.4')
        .setInputCols(["sentences", "token"])
        .setOutputCol("embeddings")
        .setMaxSentenceLength(512)
        .setBatchSize(100)
)

tok_finisher = (
    sb.Finisher()
    .setInputCols(["token"])
    .setIncludeMetadata(True)
)

embeddings_finisher = (
    sb.EmbeddingsFinisher()
            .setInputCols("embeddings")
            .setOutputCols("embeddings_vectors")
            .setOutputAsVector(True)
)

sparknlp_processing_pipeline = sb.RecursivePipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    word_embeddings,
    embeddings_finisher,
    tok_finisher
  ]
)

sdf = spark.createDataFrame(df)
spark_processed_df = sparknlp_processing_pipeline.fit(sdf).transform(sdf)
t = spark_processed_df.toPandas()
len(t['embeddings_vectors'].iloc[0])
>>> 906

len(t['finished_token_metadata'][0])
>>> 916

Context

I am trying to match sentences by overall word similarity, and am zipping up tokens and embeddings. Because the number of embeddings is different than the number of tokens, the last few tokens are given a None embedding

Your Environment

Spark NLP version sparknlp.version(): com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.5
Apache NLP version spark.version: ‘2.4.7-ds-0.5’
Setup and installation (Pypi, Conda, Maven, etc.): SparkNLP: Maven

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

alex2awesomecommented, Sep 22, 2021

This is great —

My suggestion is to somehow bake this into the Tokenizer annotator. This took hours to diagnose on my end before I opened this issue, and I was also lucky to have looked at an example where this issue was present. I don’t think any user is going to know going into building their pipeline that this will be an issue.

I think that when the Tokenizer doesn’t know how to annotate any character, it should either throw an error or a warning and output a NaN or something that downstream components know how to deal with. The fact that Tokenizer just silently drops these characters probably causes a lot of bugs for people interested in doing word-level analysis like I was (i.e. needing to match embeddings to words).

1reaction

alex2awesomecommented, Sep 16, 2021

Ackkk I’m so sorry. The rows in t (output from spark) were not in the same order as rows from df (my input), so I accidentally sent you text from the wrong input row. I was wondering why the number of tokens, 888, was different.

Please try again with the following text:

df = pd.DataFrame({'summary': ['Why is this man smiling? President Obama’s chosen successor suffered a devastating loss last week to a man who made a primary campaign issue of Obama’s “disastrous” management of the country. The Democratic Party is in a shambles, outnumbered in state legislatures, governors’ mansions, the House and the Senate. Conservative control of the Supreme Court seems likely for another generation. Obama’s legacy is in tatters, as his trade policy, his foreign policy and his beloved Obamacare are set to be dismantled. And yet when Obama entered the White House briefing room for a post-election news conference Monday afternoon, everything was, if not awesome, then pretty darned good. “We are indisputably in a stronger position today than we were when I came in eight years ago,” he began. “Jobs have been growing for 73 straight months, incomes are rising, poverty is falling, the uninsured rate is at the lowest level on record, carbon emissions have come down without impinging on our growth .\u2009.\u2009.” The happy talk kept coming: “Unemployment rate is low as it has been in eight, nine years, incomes and wages have both gone up over the last year faster than they have in a decade or two. .\u2009.\u2009. The financial systems are stable. The stock market is hovering around its all-time high and 401(k)s have been restored. The housing market has recovered. .\u2009.\u2009. We are seeing significant progress in Iraq .\u2009.\u2009. Our alliances are in strong shape. .\u2009.\u2009. And gas is two bucks a gallon.” It’s all true enough. But Obama’s post-election remarks seemed utterly at odds with the national mood. Half the country is exultant because Donald Trump has promised to undo everything Obama has done over the past eight years. The other half of the country is alarmed that a new age of bigotry and inwardness has seized the country. And here’s the outgoing president, reciting what a fine job he has done. This has been Obama’s pattern. At times when passion is called for, he’s cerebral and philosophical and taking the long view — so long that it frustrates those living in the present. A week after an election has left his supporters reeling, Obama’s focus seemed to be squarely on his own legacy. He didn’t mention Hillary Clinton’s name once in his news conference, and he went out of his way to praise Trump. On a day when the country was digesting the news that Trump has named as his top White House strategist Stephen K. Bannon, a man who has boasted of his ties to the racist “alt-right,” Obama was generous to the “carnival barker” who led the campaign questioning his American birth. Of the Bannon appointment, Obama said “it would not be appropriate for me to comment,” and “those who didn’t vote for him have to recognize that that’s how democracy works.” Of Trump himself, Obama noted “his gifts that obviously allowed him to execute one of the biggest political upsets in history.” He praised Trump as “gregarious” and “pragmatic,” a man who favors “a vigorous debate” and was “impressive” during the campaign. “That connection that he was able to make with his supporters,” Obama said, was “powerful stuff.” Obama’s above-the-fray response to the election result may well be that of a man who believes his approach will be vindicated by history. It may well be, but that is of little comfort now. As Obama retires to a life of speaking fees and good works, he sounded less concerned about what will happen next than with what he had achieved — including a mention, for those who forgot, that he won the Iowa caucuses in 2008. He took a bow for his “smartest, hardest-working” staff, his “good decisions,” the absence of “significant scandal” during his tenure. And he speculated that Trump would ultimately find it wise to leave intact the key achievements of his administration: Obamacare, the Iran nuclear deal, the Paris climate accord, trade and immigration. The deep disenchantment among white, blue-collar voters that propelled Trump won only a passing mention. “Obviously there are people out there who are feeling deeply disaffected,” the president said with his cool detachment. In an election this close — Clinton, let’s not forget, won the popular vote — any factor could have made the difference: being a candidate of the establishment in a time of change, resistance to a woman as president and backlash against the first black president, and FBI Director James B. Comey’s last-minute intervention in the election. But millions of Americans are justifiably anxious about their economic well-being. And if Clinton and Obama had limited the build-on-success theme during the campaign in favor of a more populist vision and policies, they really would have something to smile about this week. Twitter: @Milbank Read more from Dana Milbank’s archive, follow him on Twitter or subscribe to his updates on Facebook.']})

I am able to replicate my earlier reported Python results in your collaboratory notebook with this input.

Top Results From Across the Web

Get Bert Embeddings for every Token in a Sentence

I believe that you are trying to bring contextual based embedding for individual words of a sentence into picture, instead of fixed vectors ......

BERT Word Embeddings Tutorial - Chris McCormick

For our purposes, single-sentence inputs only require a series of 1s, so we will create a vector of 1s for each token in...

Should I need to use BERT embeddings while tokenizing ...

I am new to BERT and NLP and I am a little confused with tokenization and word embedding. My doubt is if I...

Transformers - Spark NLP

These word embeddings represent the outputs generated by the Albert model. All official Albert releases by google in TF-HUB are supported ...

allennlp.modules.token_embedders

The simplest TokenEmbedder is just an embedding layer, ... class allennlp.modules.token_embedders.embedding. ... build all of this easily from_params.