Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Query with asymmetric matching for accented characters

See original GitHub issue

Use case

OK - let me admit it before I start… I am a dumb native English speaker and I don’t like letters with accents and don’t even know how to enter most of them on my keyboard. Yep, I know: the rest of the world despises me.

However, for people like me (and, to be fair, some other cases where accented letter use varies even between speakers of the same or very closely related languages, or has differed at various historical times), it would be very nice to have the optional ability to do simplified comparisons when searching in beets.

Today, in beets, if I use beet ls -a dvořák I get 5 matches in my database. However, if I use beet ls -a dvorak I get no matches. Although I support MusicBrainz’s requirement to use correct spelling, as a non-speaker of the language I would have to search online to look up the correct accents to use. It would be useful if beets allowed (perhaps optionally, based on a config setting) less strict searching. The case is even worse with Noel, which appears in various places in my database in both the forms Noël and Noel - similar problems exist for other partly non-naturalised words which are sometimes spelt in English with accents and sometimes not.

Solution

A popular, powerful and standardised by the Unicode consortium approach, is called asymmetric matching. In this case, speakers choose a language (e.g. English) and can write search terms using the characters of that language: the system then matches accented letters not present in that language with the unaccented forms in the query. However, if the user actually enters an accent in the search term, then only the same accented letter matches. The link above includes some examples of using the term “resume” to match the many words spelt “resume” but with accents on various different letters.

This solution feels very natural and is easy for both speakers and non-speakers of the language to use. Unfortunately, I have not found an existing Python implementation (but I haven’t looked particularly hard).

Alternatives

Regex searches can provide some capabilities but they require particular fields to be specified and require the use of the arcane regex format. For example, the Dvorak search would probably need to be entered as `artist::Dvo.a.’ and requires me to know which letters are accented. It would also miss cases where the composer is mentioned in the title or albumtitle but not as an artist.

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

GrahamCobbcommented, Mar 17, 2021

The PR #3883 has now been accepted into beets - thanks to all who helped me with that. I think the bareasc plugin provides a good-enough solution so I don’t plan to work on a more formal approach. So, I will close this issue now.

0reactions

GrahamCobbcommented, Mar 15, 2021

Apart from my idiocy last night with using == instead of in and then writing a test which worked anyway 😃 the bareasc plugin seems to work. And it seems to meet my needs. What do others think?

I did a quick-and-dirty performance test… on my database (currently around 3200 tracks), a simple beet ls xyzzy (which does not exist) takes around 270ms; a beet ls \#xyzzy takes around 560ms. These are fairly consistent and remain so even for longer strings. It would be interesting to see what others find - for example with significantly larger databases.

If it turns out that doing the comparison in sqlite is actually faster, then I can imagine a way to make that happen: every database record write could add an extra internal column which contains ALL the text, from all fields, just appended together into one string and put through lower and unidecode. The initial match check could be done in SQL against that column and only if that matches is the proper, field-specific match check done in the python code.

Top Results From Across the Web

Find accented and non-accented variations of same word

I have already set the character set and collation. I am trying to generate a report that finds strings in different records that...

Normalization (equivalence classing of terms)

An example of how such an asymmetry can be exploited is shown in Figure 2.6 : if the user enters windows, we wish...

ESQL complex comparison operators - IBM

The ASYMMETRIC form is simpler but returns only the result that you expect when the first boundary value has a smaller value than...

UTS #10: Unicode Collation Algorithm

Using this technique, all differences in the fields are taken into account, and the levels are considered uniformly. Accents in all fields are...

The Bazel Query Reference

It also describes the output formats bazel query supports. ... drawn from the alphabet characters A-Za-z, the numerals 0-9, and the special characters...