Query with asymmetric matching for accented characters
See original GitHub issueUse case
OK - let me admit it before I start… I am a dumb native English speaker and I don’t like letters with accents and don’t even know how to enter most of them on my keyboard. Yep, I know: the rest of the world despises me.
However, for people like me (and, to be fair, some other cases where accented letter use varies even between speakers of the same or very closely related languages, or has differed at various historical times), it would be very nice to have the optional ability to do simplified comparisons when searching in beets.
Today, in beets, if I use beet ls -a dvořák
I get 5 matches in my database. However, if I use beet ls -a dvorak
I get no matches. Although I support MusicBrainz’s requirement to use correct spelling, as a non-speaker of the language I would have to search online to look up the correct accents to use. It would be useful if beets allowed (perhaps optionally, based on a config setting) less strict searching. The case is even worse with Noel, which appears in various places in my database in both the forms Noël and Noel - similar problems exist for other partly non-naturalised words which are sometimes spelt in English with accents and sometimes not.
Solution
A popular, powerful and standardised by the Unicode consortium approach, is called asymmetric matching. In this case, speakers choose a language (e.g. English) and can write search terms using the characters of that language: the system then matches accented letters not present in that language with the unaccented forms in the query. However, if the user actually enters an accent in the search term, then only the same accented letter matches. The link above includes some examples of using the term “resume” to match the many words spelt “resume” but with accents on various different letters.
This solution feels very natural and is easy for both speakers and non-speakers of the language to use. Unfortunately, I have not found an existing Python implementation (but I haven’t looked particularly hard).
Alternatives
Regex searches can provide some capabilities but they require particular fields to be specified and require the use of the arcane regex format. For example, the Dvorak search would probably need to be entered as `artist::Dvo.a.’ and requires me to know which letters are accented. It would also miss cases where the composer is mentioned in the title or albumtitle but not as an artist.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
The PR #3883 has now been accepted into beets - thanks to all who helped me with that. I think the
bareasc
plugin provides a good-enough solution so I don’t plan to work on a more formal approach. So, I will close this issue now.Apart from my idiocy last night with using
==
instead ofin
and then writing a test which worked anyway 😃 the bareasc plugin seems to work. And it seems to meet my needs. What do others think?I did a quick-and-dirty performance test… on my database (currently around 3200 tracks), a simple
beet ls xyzzy
(which does not exist) takes around 270ms; abeet ls \#xyzzy
takes around 560ms. These are fairly consistent and remain so even for longer strings. It would be interesting to see what others find - for example with significantly larger databases.If it turns out that doing the comparison in sqlite is actually faster, then I can imagine a way to make that happen: every database record write could add an extra internal column which contains ALL the text, from all fields, just appended together into one string and put through
lower
andunidecode
. The initial match check could be done in SQL against that column and only if that matches is the proper, field-specific match check done in the python code.