question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

URL previews includes unicode punycode which causes issues URLs in the text body

See original GitHub issue

Description

https://github.com/vector-im/element-web/issues/23432

Synapse injects (5o0a) punycode into og:description of twitter URLs, leading to broken links.

Steps to reproduce

Forwarded Element issue:

Steps to reproduce

  1. Post a twitter URL into a room, URL previews enabled
  2. Twitter post content is put into double quotes ""
  3. in case of media-only posts, the t.co media URL is in quotes
  4. element includes the ending quote in the URL, resulting in a broken link

Outcome

What did you expect?

links work

What happened instead?

image

Demo URLs for reference: https://twitter.com/FXNetworks/status/1577704289476128771 https://twitter.com/mischiefanimals/status/1576904037449969664

Operating system

arch

Application version

Element Nightly version: 2022100501 Olm version: 3.2.12

How did you install the app?

aur

Homeserver

private

Synapse Version

1.68.0

Installation Method

Docker (matrixdotorg/synapse)

Platform

debian, matrix-docker-ansible-deploy

Relevant log output

n/a

Anything else that would be useful to know?

curl "https://publish.twitter.com/oembed?url=https://twitter.com/mischiefanimals/status/1576904037449969664" | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   853  100   853    0     0   3946      0 --:--:-- --:--:-- --:--:--  4140
{
  "url": "https://twitter.com/mischiefanimals/status/1576904037449969664",
  "author_name": "animals going goblin mode",
  "author_url": "https://twitter.com/mischiefanimals",
  "html": "<blockquote class=\"twitter-tweet\"><p lang=\"zxx\" dir=\"ltr\"><a href=\"https://t.co/fVP8YWHS2j\">pic.twitter.com/fVP8YWHS2j</a></p>&mdash; animals going goblin mode (@mischiefanimals) <a href=\"https://twitter.com/mischiefanimals/status/1576904037449969664?ref_src=twsrc%5Etfw\">October 3, 2022</a></blockquote>\n<script async src=\"https://platform.twitter.com/widgets.js\" charset=\"utf-8\"></script>\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https://twitter.com",
  "version": "1.0"
}

https://user-images.githubusercontent.com/2403652/194146661-044758f9-fefd-4744-9ef2-bd3aec094d40.png

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
clokepcommented, Oct 6, 2022

I downgraded to tolerable since you can easily click the link being previewed, and uncommon since I think this is an odd example where a URL was given as the only tweet content (which seems a bit weird – but do shout if this seems incorrect).

1reaction
clokepcommented, Oct 6, 2022

We should be preferring this, see:

Ah, I see what’s going on – Twitter doesn’t have oEmbed autodiscovery enabled, so we are only scraping the HTML in this case.

If we were to add the Twitter URLs back to the providers.json than we would only fetch the oEmbed, which would lose the image preview.

We could always scrape the given URL, but also check oEmbed info if available. Would be reasonable in terms of “try to find the most info possible”, but would result in duplicate queries in some situations… it would treat autodiscovery of oEmbed more similar to the hard-coded providers list though. 🤷

Read more comments on GitHub >

github_iconTop Results From Across the Web

Out of Character: Use of Punycode and Homoglyph Attacks to ...
The code that automatically parses the text to create hyperlinks recognizes the ucu.org part of the URL, and links to ucu.org, but the...
Read more >
Punycode converter (IDN converter), Punycode to Unicode 🔧
A tool that converts a text with special characters (Unicode) to the Punycode encoding (just ASCII). Used for internationalized domain names (IDN).
Read more >
You Think You Can't Be Phished? - Hackaday
This is a page to demonstrate a type of Unicode vulnerability in how the browser interprets and show the URL to the user....
Read more >
Punycode attacks - the fake domains that are impossible to ...
Safari – most of the time translates the punycode to Unicode characters. When it is sure that the site is malicious, it will...
Read more >
(Please) Stop Using Unsafe Characters in URLs
The character # is unsafe and should always be encoded because it is used in World Wide Web and in other systems to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found