question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Search user does not work with some specific Vietnamese letters

See original GitHub issue

Description

If you search for a user having some special letters with accents in its name (like “á”) then the suggestions become empty as soon as you type the subsequent letter after the special character.

See also the attached video demonstrating the problem.

Steps to reproduce

Please refer to the attached video.

Homeserver

matrix.org

Synapse Version

{“server_version”:“1.66.0rc1 (b=matrix-org-hotfixes,ce8f7d118c)”,“python_version”:“3.8.12”}

Installation Method

No response

Platform

app.element.io as webclient, matrix.org as homeserver.

https://user-images.githubusercontent.com/66737707/187195948-2c1f1aae-b72b-4686-a716-3df89cea020f.mp4

Relevant log output

We have also reproduced the problem on our own debug server, the following log lines are the relevant part of the synapse server log:

2022-08-29 14:03:19,983 - synapse.storage.txn - 795 - DEBUG - expire_url_cache_data-763 - [TXN END] {get_url_cache_media_before-17c4} 0.001260 sec
2022-08-29 14:03:19,983 - synapse.rest.media.v1.preview_url_resource - 840 - DEBUG - expire_url_cache_data-763 - No media removed from url preview cache
2022-08-29 14:03:19,990 - synapse.storage.TIME - 602 - DEBUG - sentinel - Total database time: 0.092% {_prune_old_user_ips(2): 0.038%, _update_client_ips_batch(1): 0.031%, get_url_cache_media_before(1): 0.013%}
2022-08-29 14:03:21,645 - synapse.access.http.8008 - 405 - DEBUG - GET-2345 - ::ffff:127.0.0.1 - 8008 - Received request: GET /health
2022-08-29 14:03:21,646 - synapse.access.http.8008 - 450 - DEBUG - GET-2345 - ::ffff:127.0.0.1 - 8008 - {None} Processed request: 0.000sec/-0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "GET /health HTTP/1.1" "curl/7.74.0" [0 dbevts]
2022-08-29 14:03:23,330 - synapse.http.site - 533 - WARNING - sentinel - forwarded request lacks an x-forwarded-proto header: assuming https
2022-08-29 14:03:23,330 - synapse.access.http.8008 - 405 - DEBUG - OPTIONS-2346 - 185.150.4.97 - 8008 - Received request: OPTIONS /_matrix/client/r0/user_directory/search
2022-08-29 14:03:23,331 - synapse.access.http.8008 - 450 - DEBUG - OPTIONS-2346 - 185.150.4.97 - 8008 - {None} Processed request: 0.000sec/-0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 0B 204 "OPTIONS /_matrix/client/r0/user_directory/search HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" [0 dbevts]
2022-08-29 14:03:23,337 - synapse.http.site - 533 - WARNING - sentinel - forwarded request lacks an x-forwarded-proto header: assuming https
2022-08-29 14:03:23,337 - synapse.access.http.8008 - 405 - DEBUG - POST-2347 - 185.150.4.97 - 8008 - Received request: POST /_matrix/client/r0/user_directory/search
2022-08-29 14:03:23,338 - synapse.storage.txn - 691 - DEBUG - POST-2347 - [TXN START] {search_user_dir-17c5}
2022-08-29 14:03:23,338 - synapse.storage.SQL - 409 - DEBUG - POST-2347 - [SQL] {search_user_dir-17c5} SELECT d.user_id AS user_id, display_name, avatar_url FROM user_directory_search as t INNER JOIN user_directory AS d USING (user_id) WHERE user_id != ? AND vector @@ to_tsquery('simple', ?) ORDER BY (CASE WHEN d.user_id IS NOT NULL THEN 4.0 ELSE 1.0 END) * (CASE WHEN display_name IS NOT NULL THEN 1.2 ELSE 1.0 END) * (CASE WHEN avatar_url IS NOT NULL THEN 1.2 ELSE 1.0 END) * ( 3 * ts_rank_cd( '{0.1, 0.1, 0.9, 1.0}', vector, to_tsquery('simple', ?), 8 ) + ts_rank_cd( '{0.1, 0.1, 0.9, 1.0}', vector, to_tsquery('simple', ?), 8 ) ) DESC, display_name IS NULL, avatar_url IS NULL LIMIT ?
2022-08-29 14:03:23,339 - synapse.storage.SQL - 417 - DEBUG - POST-2347 - [SQL values] {search_user_dir-17c5} ('@2_cb874bcb1f1b5219:anconnect-server-dev107.aarenet.com', '(Gi:* | Gi)', 'Gi', 'Gi:*', 11)
2022-08-29 14:03:23,341 - synapse.storage.SQL - 438 - DEBUG - POST-2347 - [SQL time] {search_user_dir-17c5} 0.002775 sec
2022-08-29 14:03:23,342 - synapse.storage.txn - 795 - DEBUG - POST-2347 - [TXN END] {search_user_dir-17c5} 0.003450 sec
2022-08-29 14:03:23,343 - synapse.access.http.8008 - 450 - INFO - POST-2347 - 185.150.4.97 - 8008 - {@2_cb874bcb1f1b5219:anconnect-server-dev107.aarenet.com} Processed request: 0.005sec/0.001sec (0.002sec, 0.000sec) (0.000sec/0.003sec/1) 155B 200 "POST /_matrix/client/r0/user_directory/search HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" [0 dbevts]
2022-08-29 14:03:23,414 - synapse.metrics._gc - 118 - DEBUG - sentinel - Collecting gc 0
2022-08-29 14:03:24,654 - synapse.storage.txn - 691 - DEBUG - prune_old_user_ips-1528 - [TXN START] {_prune_old_user_ips-17c6}
2022-08-29 14:03:24,654 - synapse.storage.SQL - 409 - DEBUG - prune_old_user_ips-1528 - [SQL] {_prune_old_user_ips-17c6} DELETE FROM user_ips WHERE last_seen <= ( SELECT COALESCE(MAX(last_seen), -1) FROM ( SELECT last_seen FROM user_ips WHERE last_seen <= ? ORDER BY last_seen ASC LIMIT 5000 ) AS u )
2022-08-29 14:03:24,655 - synapse.storage.SQL - 417 - DEBUG - prune_old_user_ips-1528 - [SQL values] {_prune_old_user_ips-17c6} (1659355404653,)
2022-08-29 14:03:24,655 - synapse.storage.SQL - 438 - DEBUG - prune_old_user_ips-1528 - [SQL time] {_prune_old_user_ips-17c6} 0.000770 sec
2022-08-29 14:03:24,656 - synapse.storage.txn - 795 - DEBUG - prune_old_user_ips-1528 - [TXN END] {_prune_old_user_ips-17c6} 0.001809 sec
2022-08-29 14:03:24,757 - synapse.http.site - 533 - WARNING - sentinel - forwarded request lacks an x-forwarded-proto header: assuming https
2022-08-29 14:03:24,758 - synapse.access.http.8008 - 405 - DEBUG - POST-2348 - 185.150.4.97 - 8008 - Received request: POST /_matrix/client/r0/user_directory/search
2022-08-29 14:03:24,759 - synapse.storage.txn - 691 - DEBUG - POST-2348 - [TXN START] {search_user_dir-17c7}
2022-08-29 14:03:24,759 - synapse.storage.SQL - 409 - DEBUG - POST-2348 - [SQL] {search_user_dir-17c7} SELECT d.user_id AS user_id, display_name, avatar_url FROM user_directory_search as t INNER JOIN user_directory AS d USING (user_id) WHERE user_id != ? AND vector @@ to_tsquery('simple', ?) ORDER BY (CASE WHEN d.user_id IS NOT NULL THEN 4.0 ELSE 1.0 END) * (CASE WHEN display_name IS NOT NULL THEN 1.2 ELSE 1.0 END) * (CASE WHEN avatar_url IS NOT NULL THEN 1.2 ELSE 1.0 END) * ( 3 * ts_rank_cd( '{0.1, 0.1, 0.9, 1.0}', vector, to_tsquery('simple', ?), 8 ) + ts_rank_cd( '{0.1, 0.1, 0.9, 1.0}', vector, to_tsquery('simple', ?), 8 ) ) DESC, display_name IS NULL, avatar_url IS NULL LIMIT ?
2022-08-29 14:03:24,759 - synapse.storage.SQL - 417 - DEBUG - POST-2348 - [SQL values] {search_user_dir-17c7} ('@2_cb874bcb1f1b5219:anconnect-server-dev107.aarenet.com', '(Gia:* | Gia)', 'Gia', 'Gia:*', 11)
2022-08-29 14:03:24,762 - synapse.storage.SQL - 438 - DEBUG - POST-2348 - [SQL time] {search_user_dir-17c7} 0.002633 sec
2022-08-29 14:03:24,762 - synapse.storage.txn - 795 - DEBUG - POST-2348 - [TXN END] {search_user_dir-17c7} 0.003215 sec
2022-08-29 14:03:24,763 - synapse.access.http.8008 - 450 - INFO - POST-2348 - 185.150.4.97 - 8008 - {@2_cb874bcb1f1b5219:anconnect-server-dev107.aarenet.com} Processed request: 0.005sec/0.000sec (0.002sec, 0.000sec) (0.000sec/0.003sec/1) 155B 200 "POST /_matrix/client/r0/user_directory/search HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" [0 dbevts]
2022-08-29 14:03:24,908 - synapse.handlers.typing - 100 - DEBUG - typing._handle_timeouts-1528 - Checking for typing timeouts
2022-08-29 14:03:24,908 - synapse.handlers.presence - 901 - DEBUG - handle_presence_timeouts-1522 - Handling presence timeouts
2022-08-29 14:03:24,908 - synapse.util.metrics - 163 - DEBUG - handle_presence_timeouts-1522 - Entering block presence_update_states
2022-08-29 14:03:24,909 - synapse.util.metrics - 176 - DEBUG - handle_presence_timeouts-1522 - Exiting block presence_update_states
2022-08-29 14:03:26,325 - synapse.http.site - 533 - WARNING - sentinel - forwarded request lacks an x-forwarded-proto header: assuming https
2022-08-29 14:03:26,326 - synapse.access.http.8008 - 405 - DEBUG - POST-2349 - 185.150.4.97 - 8008 - Received request: POST /_matrix/client/r0/user_directory/search
2022-08-29 14:03:26,327 - synapse.storage.txn - 691 - DEBUG - POST-2349 - [TXN START] {search_user_dir-17c8}
2022-08-29 14:03:26,327 - synapse.storage.SQL - 409 - DEBUG - POST-2349 - [SQL] {search_user_dir-17c8} SELECT d.user_id AS user_id, display_name, avatar_url FROM user_directory_search as t INNER JOIN user_directory AS d USING (user_id) WHERE user_id != ? AND vector @@ to_tsquery('simple', ?) ORDER BY (CASE WHEN d.user_id IS NOT NULL THEN 4.0 ELSE 1.0 END) * (CASE WHEN display_name IS NOT NULL THEN 1.2 ELSE 1.0 END) * (CASE WHEN avatar_url IS NOT NULL THEN 1.2 ELSE 1.0 END) * ( 3 * ts_rank_cd( '{0.1, 0.1, 0.9, 1.0}', vector, to_tsquery('simple', ?), 8 ) + ts_rank_cd( '{0.1, 0.1, 0.9, 1.0}', vector, to_tsquery('simple', ?), 8 ) ) DESC, display_name IS NULL, avatar_url IS NULL LIMIT ?
2022-08-29 14:03:26,328 - synapse.storage.SQL - 417 - DEBUG - POST-2349 - [SQL values] {search_user_dir-17c8} ('@2_cb874bcb1f1b5219:anconnect-server-dev107.aarenet.com', '(Gia:* | Gia) & (o:* | o)', 'Gia & o', 'Gia:* & o:*', 11)
2022-08-29 14:03:26,329 - synapse.storage.SQL - 438 - DEBUG - POST-2349 - [SQL time] {search_user_dir-17c8} 0.001021 sec
2022-08-29 14:03:26,329 - synapse.storage.txn - 795 - DEBUG - POST-2349 - [TXN END] {search_user_dir-17c8} 0.001580 sec
2022-08-29 14:03:26,330 - synapse.access.http.8008 - 450 - INFO - POST-2349 - 185.150.4.97 - 8008 - {@2_cb874bcb1f1b5219:anconnect-server-dev107.aarenet.com} Processed request: 0.004sec/0.001sec (0.002sec, 0.000sec) (0.001sec/0.002sec/1) 30B 200 "POST /_matrix/client/r0/user_directory/search HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" [0 dbevts]
2022-08-29 14:03:28,370 - synapse.storage.txn - 691 - DEBUG - _get_stats_for_federation_staging-254 - [TXN START] {_get_stats_for_federation_staging-17c9}
2022-08-29 14:03:28,370 - synapse.storage.SQL - 409 - DEBUG - _get_stats_for_federation_staging-254 - [SQL] {_get_stats_for_federation_staging-17c9} SELECT count(*) FROM federation_inbound_events_staging
2022-08-29 14:03:28,371 - synapse.storage.SQL - 438 - DEBUG - _get_stats_for_federation_staging-254 - [SQL time] {_get_stats_for_federation_staging-17c9} 0.000456 sec

Anything else that would be useful to know?

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
belakalotaycommented, Oct 12, 2022

Line 918 probably needs to accept \p{Mark} code points, except that syntax isn’t supported by the default re library in Python. The regex package does, or re could be given code point ranges to match. There are probably other places in the code where \w is used with the same intent too.

To expand on this: the quick and dirty proposal is to use something like regex.findall(r"\b\w.*?\b", search_term, regex.WORD) instead of re.findall(r"([\w\-]+)", search_term, re.UNICODE) to identify whole words.

This will fix exact matches not working, but will not resolve #1523, where Gao or Gáo (a with acute accent) will not match Gáo (a followed by combining acute accent). The latter may or may not already work depending on what postgres does.

Note that this still performs poorly. There are languages whose words consist of a variable number of \w code points that do not have spaces between them. A much better solution is to integrate libicu’s word boundaries (https://unicode-org.github.io/icu/userguide/boundaryanalysis/#word-boundary), which is what chromium supposedly uses. In the comparison below, it can be seen that only icu does something vaguely reasonable for Japanese.

Test code to compare re, regex and icu

Text: "It's a nice day outside."
    re.findall(r"([\w\-]+)"):    ['It', 's', 'a', 'nice', 'day', 'outside']
    re.findall(r"\b\w.*?\b"):    ['It', 's', 'a', 'nice', 'day', 'outside']
    regex.findall(r"\b\w.*?\b"): ["It's", 'a', 'nice', 'day', 'outside']
    icu:                         ["It's", ' ', 'a', ' ', 'nice', ' ', 'day', ' ', 'outside', '.']
Text: 'Received foo.png!'
    re.findall(r"([\w\-]+)"):    ['Received', 'foo', 'png']
    re.findall(r"\b\w.*?\b"):    ['Received', 'foo', 'png']
    regex.findall(r"\b\w.*?\b"): ['Received', 'foo.png']
    icu:                         ['Received', ' ', 'foo.png', '!']
Text: 'Gáo'
    re.findall(r"([\w\-]+)"):    ['Ga', 'o']
    re.findall(r"\b\w.*?\b"):    ['Ga', 'o']
    regex.findall(r"\b\w.*?\b"): ['Gáo']
    icu:                         ['Gáo']

We have tried this approach, but unfortunately it didn’t help.

1reaction
DMRobertsoncommented, Sep 7, 2022

@reivilibre’s view is that it would be best if we can find a way to have postgres or some library handle all this for us.

Or even some external full-text search database. Lucene or something that uses it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

error when typing Vietnamese on IE 11 (Windows 7)
Hello! I have run into a problem when typing Vietnamese text on IE. When I type a sentence, some characters on the sentence...
Read more >
Cannot use Vietnamese / signed & space characters in URL ...
The problem is whenever I pass a Vietnamese keyword in the URL, the blazor throw an inner exception that appears on Browser console....
Read more >
Vietnamese for Engineers - Glints Tech
The additional letters are đ, ă, â, ê, ô, ơ and ư. While pronouncing them as if they did not have any marks...
Read more >
How to type Vietnamese on computer
Learn how to type Vietnamese on computer. Learn how to get Vietnamese keyboard, how to type with and without Unikey.
Read more >
Write in another language on Mac - Apple Support
Click the Add button , then search for a language (such as Chinese, Simplified). Select one or more input sources for each language...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found