lyrics: Tolerate more tags when gathering lyrics texts in Google backend
See original GitHub issueProblem
I finally got beets working again under Linux Mint 20.2 with the recently-released version beets 1.5.0 (thank you to everyone involved with the beats team that helped make that long-awaited update happen!). I am a user of the Lyrics plug-in, and took the opportunity to obtain a Google API key to help improve the lyrics scraping.
I’ve imported a handful of albums so far, and while most of the functionality, including lyrics scraping, is working well, there are times when portions of the lyrics are being missed for certain songs. Though I continue to investigate this, the testing I have been able to do so far (combined with including the verbose (-vv
) mode parameter) seems to indicate this primarily happens with AZLyrics (www.azlyrics.com).
One album I have been using that consistently reproduces the error is Molly Hatchet’s first album (self-titled “Molly Hatchet”). Verbose mode of the command output shows that Google used AZLyrics for most songs, with SongLyrtics (www.songlyrics.com) being used in a few cases. For the lyrics being provided from AZLyrics, a few songs did not have any issues at all:
- Gator Country
- Dreams I’ll Never See
However, quite a few had only a portion of the complete lyrics, as shown in Puddletag. These include:
- Bounty Hunter
- The Creeper
- The Price You Pay
- I’ll Be Runing
- Cheatin’ Woman
- Trust Your Old Friend
I haven’t tried searching through source code on Github, but one thing I noticed anecdotally is that the songs having this problem seemed to have identifiers within the lyrics that used brackets (“[” and “]”) to denote a subset of the lyrics, such as “[Chorus:]” or “[LEAD BREAK]”. See The Price You Pay as an example. With that song, beets returned lyrics starting after the “[LEAD BREAK]” identifier:
I shot a man in Macon over a poker game,
I killed another in Atlanta just to build my fame,
...
The 2 songs on the album not encountering the lyrics problem did not have any such bracketed-identifiers in their lyrics. If I can provide any documentation in addition to the configuration file (below), please let me know what might help.
Setup
- OS: Linux Mint 20.2
- Python version: 3.8.10
- beets version: 1.5.0
- Turning off plugins made problem go away (yes/no): (N/A, since lyrics are provided by a plug-in)
My configuration (output of beet config
) is:
directory: ~/Music
library: ~/data/musiclibrary.blb
id3v23: yes
plugins: lyrics fetchart zero embedart scrub
import:
move: yes
write: yes
log: /home/dan/Music/import-log.txt
timid: yes
zero:
fields: comments day
lyrics:
force: yes
google_API_key: <My key here - not included to avoid potential hacking>
sources: musixmatch google
embedart:
maxwidth: 300
remove_art_file: yes
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (6 by maintainers)
Top GitHub Comments
I should probably add some meta-commentary here: the Google backend, unlike the other lyrics backends in this plugin, is necessarily very heuristic. That is, there will never be a consistent set of rules that works on all lyrics pages on the Web; all we can do is our best: iteratively improve the parser when we run across examples that need help and are fixable. But there are an infinite number of such fixes that are possible and perfection, in the end, is unattainable.
So, I imported a song off of the album that doesn’t encounter the problem, and as I think was expected from analysis above, the entire lyrics’ text was contained within a single string. That’s why the “stripped_strings” function was successful, because there was only one text blob in competition (actually, there was a 2nd string identifying the album, song, and AZLyrics.com site, which was quite small in comparison).
I analyzed lyrics for the song I’ve been using to reproduce the problem, and the lyrics definitely are broken up into 6 different blobs (not counting the album/song/AZlyrics string) before the “stripped_strings” function is invoked. In looking at the lyrics at the azlyrics.com web site, the points at which the lyrics separate into the 6 blobs occur with the bracketed text I mentioned in my original post (for example, where
[Chorus:]
or[LEAD BREAK]
appear within the lyrics’ text).I think there could be 2 approaches: Either preceding code could be tweaked to try and prevent the lyrics from splitting up, or if that proves difficult, logic could be added to combine the resulting blobs into one string. I think the former option would probably be the preferred one. In this particular case, the bracketed text is surrounded by italics elements
(<i>
and</i>
), so those tags could be the cause rather than the brackets.I’ve hit my limit for tonight, but plan to next see if I can find where and how the lyrics’ fracturing happens.