Text embedding: `Failed to embed text` on wikipedia pages
See original GitHub issueFollow up to https://github.com/brave/brave-browser/issues/23424
Steps to Reproduce
- Run Brave
--enable-logging=stderr --vmodule=text_embedding_processor=9,embedding_processing=9,text_embedding_html_events=9 --enable-features=TextEmbedding
- Enable rewards and ads
- Restart brave
- Open
https://en.wikipedia.org/wiki/Svante_P%C3%A4%C3%A4bo
- Check logs
Actual result:
[6876:6876:1009/142934.888115:VERBOSE1:text_embedding_processor.cc(61)] Failed to embed text
Note: the processing works on other pages, e.g. interia.pl
Expected result:
embed text is processed
Reproduces how often:
Easily reproduced
Brave version (brave://version info)
Brave | 1.45.90 Chromium: 106.0.5249.103 (Official Build) beta (64-bit) |
---|---|
Revision | 182570408a1f25ab2731ef5f283b918df9b9f956-refs/branch-heads/5249_91@{#6} |
OS | Ubuntu 18.04 LTS |
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Help:Wikitext - Wikipedia
Indent text Indentation is most commonly used on talk pages. causes the line to be indented by three more character positions. so long...
Read more >Wikipedia:Embedded citations
Description. Embedded citations are offered as one option for citing sources on Wikipedia. This approach is to place a numbered external link in...
Read more >Wikipedia:Manual of Style/Hidden text
On Wikipedia, hidden text is text that is visible when editing the source for the page or when using VisualEditor, but not on...
Read more >Wikipedia:Citing sources
A general reference is a citation to a reliable source that supports content, but is not linked to any particular text in the...
Read more >Object Linking and Embedding - Wikipedia
Object Linking & Embedding (OLE) is a proprietary technology developed by Microsoft that ... This article needs additional citations for verification.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@tmancey I have no issue with adjusting the messaging. I can ask @LorenzoMinto to include it in the PR he currently has open that uses the embeddings to guide ad serving
@btlechowski @tmancey
Failed to embed text
is the expected result here. Embeddings are created from the words in the meta tag with property og:title. In the example shared above, that would be:<meta property="og:title" content="Svante Pääbo - Wikipedia">
The words svante, pääbo, wikipedia are not within-vocab for the current word-embedding mapping. This is mentioned by the following logging:
[73708:259:1010/225649.460944:VERBOSE9:embedding_processing.cc(88)] svante - text embedding token not found in resource vocabulary [73708:259:1010/225649.460995:VERBOSE9:embedding_processing.cc(88)] pääbo - text embedding token not found in resource vocabulary [73708:259:1010/225649.461020:VERBOSE9:embedding_processing.cc(88)] wikipedia - text embedding token not found in resource vocabulary
Since no words are available to embed, we fail to embed text in this situation. To see a working example that does embed text, see the following examples:
https://en.wikipedia.org/wiki/Mathematics
https://en.wikipedia.org/wiki/Law
https://en.wikipedia.org/wiki/History_of_science