URL previews includes unicode punycode which causes issues URLs in the text body
See original GitHub issueDescription
https://github.com/vector-im/element-web/issues/23432
Synapse injects “
(5o0a
) punycode into og:description of twitter URLs, leading to broken links.
Steps to reproduce
Forwarded Element issue:
Steps to reproduce
- Post a twitter URL into a room, URL previews enabled
- Twitter post content is put into double quotes
""
- in case of media-only posts, the t.co media URL is in quotes
- element includes the ending quote in the URL, resulting in a broken link
Outcome
What did you expect?
links work
What happened instead?
Demo URLs for reference: https://twitter.com/FXNetworks/status/1577704289476128771 https://twitter.com/mischiefanimals/status/1576904037449969664
Operating system
arch
Application version
Element Nightly version: 2022100501 Olm version: 3.2.12
How did you install the app?
aur
Homeserver
private
Synapse Version
1.68.0
Installation Method
Docker (matrixdotorg/synapse)
Platform
debian, matrix-docker-ansible-deploy
Relevant log output
n/a
Anything else that would be useful to know?
curl "https://publish.twitter.com/oembed?url=https://twitter.com/mischiefanimals/status/1576904037449969664" | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 853 100 853 0 0 3946 0 --:--:-- --:--:-- --:--:-- 4140
{
"url": "https://twitter.com/mischiefanimals/status/1576904037449969664",
"author_name": "animals going goblin mode",
"author_url": "https://twitter.com/mischiefanimals",
"html": "<blockquote class=\"twitter-tweet\"><p lang=\"zxx\" dir=\"ltr\"><a href=\"https://t.co/fVP8YWHS2j\">pic.twitter.com/fVP8YWHS2j</a></p>— animals going goblin mode (@mischiefanimals) <a href=\"https://twitter.com/mischiefanimals/status/1576904037449969664?ref_src=twsrc%5Etfw\">October 3, 2022</a></blockquote>\n<script async src=\"https://platform.twitter.com/widgets.js\" charset=\"utf-8\"></script>\n",
"width": 550,
"height": null,
"type": "rich",
"cache_age": "3153600000",
"provider_name": "Twitter",
"provider_url": "https://twitter.com",
"version": "1.0"
}
https://user-images.githubusercontent.com/2403652/194146661-044758f9-fefd-4744-9ef2-bd3aec094d40.png
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Out of Character: Use of Punycode and Homoglyph Attacks to ...
The code that automatically parses the text to create hyperlinks recognizes the ucu.org part of the URL, and links to ucu.org, but the...
Read more >Punycode converter (IDN converter), Punycode to Unicode 🔧
A tool that converts a text with special characters (Unicode) to the Punycode encoding (just ASCII). Used for internationalized domain names (IDN).
Read more >You Think You Can't Be Phished? - Hackaday
This is a page to demonstrate a type of Unicode vulnerability in how the browser interprets and show the URL to the user....
Read more >Punycode attacks - the fake domains that are impossible to ...
Safari – most of the time translates the punycode to Unicode characters. When it is sure that the site is malicious, it will...
Read more >(Please) Stop Using Unsafe Characters in URLs
The character # is unsafe and should always be encoded because it is used in World Wide Web and in other systems to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I downgraded to tolerable since you can easily click the link being previewed, and uncommon since I think this is an odd example where a URL was given as the only tweet content (which seems a bit weird – but do shout if this seems incorrect).
Ah, I see what’s going on – Twitter doesn’t have oEmbed autodiscovery enabled, so we are only scraping the HTML in this case.
If we were to add the Twitter URLs back to the
providers.json
than we would only fetch the oEmbed, which would lose the image preview.We could always scrape the given URL, but also check oEmbed info if available. Would be reasonable in terms of “try to find the most info possible”, but would result in duplicate queries in some situations… it would treat autodiscovery of oEmbed more similar to the hard-coded providers list though. 🤷