Skip to content

Conversation

@grillofran
Copy link

@grillofran grillofran commented Dec 23, 2025

Problem

The lyrics plugin has two bugs that corrupt fetched lyrics:

  1. Unicode corruption: Characters like ò, è, à are corrupted to √≤, √®, etc.
  2. Escaped quotes: Quotes appear as \" instead of " in lyrics

Root Causes

Issue 1: MacRoman encoding misdetection

  • Location: RequestHandler.fetch_text() line 220
  • Cause: Setting r.encoding = None forces requests to use apparent_encoding
  • Problem: For Genius.com (and others), requests incorrectly detects MacRoman instead of UTF-8
  • Result: UTF-8 bytes c3 b2 (ò) decoded as MacRoman produces "√≤" (U+221A U+2264)

Issue 2: Incomplete JSON unescape

  • Location: Genius.scrape() line 576
  • Cause: The remove_backslash regex doesn't handle all escape patterns in JSON
  • Problem: Genius embeds lyrics in JSON with patterns like \\" and \\\\"
  • Result: After BeautifulSoup processing, escaped quotes remain in final text

Solution

Fix 1: Trust server encoding, fallback to UTF-8

# OLD: r.encoding = None
# NEW:
if not r.encoding:
    r.encoding = 'utf-8'
  • Respects server's declared encoding (UTF-8 for Genius)
  • Falls back to UTF-8 if no encoding specified (safer than apparent_encoding)
  • Preserves original intent of handling misconfigured servers

Fix 2: Iteratively clean escaped quotes

while '\\"' in lyrics:
    lyrics = lyrics.replace('\\"', '"')
  • Handles variable escape levels (\", \\\", \\\\\")
  • Minimal change - keeps original remove_backslash regex
  • Applied after BeautifulSoup to avoid interfering with HTML parsing

Testing

Tested with:

  • Caparezza - "Argenti Vive" (Italian, many accented characters)
  • WestsideGunn - "Heel Cena" (escaped quotes in lyrics)

Before:

mi si parò davanti
\\"I got big moves\\"

After:

mi si parò davanti
"I got big moves"

Impact

  • Fixes lyrics for all languages with non-ASCII characters
  • Fixes Genius lyrics with quotes
  • No breaking changes - maintains backward compatibility
  • Minimal code changes (14 lines total)

## Problem
The lyrics plugin has two bugs that corrupt fetched lyrics:

1. **Unicode corruption**: Characters like `ò`, `è`, `à` are corrupted to `√≤`, `√®`, etc.
2. **Escaped quotes**: Quotes appear as `\"` instead of `"` in lyrics

## Root Causes

### Issue 1: MacRoman encoding misdetection
- **Location**: `RequestHandler.fetch_text()` line 220
- **Cause**: Setting `r.encoding = None` forces requests to use `apparent_encoding`
- **Problem**: For Genius.com (and others), requests incorrectly detects MacRoman instead of UTF-8
- **Result**: UTF-8 bytes `c3 b2` (ò) decoded as MacRoman produces "√≤" (U+221A U+2264)

### Issue 2: Incomplete JSON unescape
- **Location**: `Genius.scrape()` line 576
- **Cause**: The `remove_backslash` regex doesn't handle all escape patterns in JSON
- **Problem**: Genius embeds lyrics in JSON with patterns like `\\"` and `\\\\"` 
- **Result**: After BeautifulSoup processing, escaped quotes remain in final text

## Solution

### Fix 1: Trust server encoding, fallback to UTF-8
```python
# OLD: r.encoding = None
# NEW:
if not r.encoding:
    r.encoding = 'utf-8'
```
- Respects server's declared encoding (UTF-8 for Genius)
- Falls back to UTF-8 if no encoding specified (safer than apparent_encoding)
- Preserves original intent of handling misconfigured servers

### Fix 2: Iteratively clean escaped quotes
```python
while '\\"' in lyrics:
    lyrics = lyrics.replace('\\"', '"')
```
- Handles variable escape levels (`\"`, `\\\"`, `\\\\\"`)
- Minimal change - keeps original `remove_backslash` regex
- Applied after BeautifulSoup to avoid interfering with HTML parsing

## Testing

Tested with:
- Caparezza - "Argenti Vive" (Italian, many accented characters)
- WestsideGunn - "Heel Cena" (escaped quotes in lyrics)

Before:
```
mi si parò davanti
\\"I got big moves\\"
```

After:
```
mi si parò davanti
"I got big moves"
```

## Impact
- Fixes lyrics for all languages with non-ASCII characters
- Fixes Genius lyrics with quotes
- No breaking changes - maintains backward compatibility
- Minimal code changes (14 lines total)
@grillofran grillofran requested a review from a team as a code owner December 23, 2025 20:31
@github-actions
Copy link

Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The triple-quoted string added after r = self.get(...) is currently just a no-op string expression; consider converting it to a regular # comment or a proper docstring on the method for clarity and to avoid confusion.
  • The while '\"' in lyrics loop could be made more robust and efficient by either limiting the number of iterations or using a regex / unescape utility (e.g., json.loads on a quoted string fragment) to normalize escape sequences in one pass rather than repeatedly scanning and replacing.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The triple-quoted string added after `r = self.get(...)` is currently just a no-op string expression; consider converting it to a regular `#` comment or a proper docstring on the method for clarity and to avoid confusion.
- The `while '\"' in lyrics` loop could be made more robust and efficient by either limiting the number of iterations or using a regex / unescape utility (e.g., `json.loads` on a quoted string fragment) to normalize escape sequences in one pass rather than repeatedly scanning and replacing.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

## Key Improvements (addressing reviewer feedback):
- ✅ Uses regex instead of while loop (more efficient, one pass)
- ✅ No infinite loop risk
- ✅ Handles any number of backslashes before quotes
- ✅ Clear inline comments
  In the fetch_text method (line ~207):
  if not r.encoding:
      r.encoding = "utf-8"  # ← Double quotes!

  In the scrape method, make sure there are 2 blank lines before the next class:
          return None


  class Tekstowo(SearchBackend):  # ← Two blank lines above

  That should pass the formatting check! The repo follows PEP 8 style (double quotes, 2 blank lines between classes).
@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.41%. Comparing base (e048909) to head (4cfb1e3).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
beetsplug/lyrics.py 62.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6233      +/-   ##
==========================================
+ Coverage   68.22%   68.41%   +0.18%     
==========================================
  Files         138      138              
  Lines       18792    18798       +6     
  Branches     3167     3168       +1     
==========================================
+ Hits        12821    12860      +39     
+ Misses       5298     5264      -34     
- Partials      673      674       +1     
Files with missing lines Coverage Δ
beetsplug/lyrics.py 84.69% <62.50%> (+7.00%) ⬆️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@grillofran
Copy link
Author

Is the 66% patch coverage requirement strict for bug fixes? The changes are small edge-case improvements. Happy to add tests if required!

Copy link
Contributor

@semohr semohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. Since this seems to be your first contribution to beets, you might want to have a look at our developer guides. Welcome aboard!

Is the 66% patch coverage requirement strict for bug fixes? The changes are small edge-case improvements. Happy to add tests if required!

While it is not a requirement, for new code tests are always highly appreciated. Here for example showing that JSON encoded lyrics can be parsed as expected would be a good addition. Same for the inverse case showing that "normal" html lyrics still work as expected would be good (I guess the test already exists, havent look at our testsuit 😁 )


More of a general question: how come this was never raised as an issue? Seems like it should be a common issue to me. Have you tried to find related open issues yet?

html_text = cls.remove_backslash(m[0]).replace(r"\n", "\n")
return cls.get_soup(html_text).get_text().strip()
lyrics = cls.get_soup(html_text).get_text().strip()
# Clean up any remaining escaped quotes (may need multiple passes)
Copy link
Contributor

@semohr semohr Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should be minimal and focus on the why rather than the how:

#  Strip HTML and normalize quotes that were left because of JSON...
return re.sub(r'\\+"', '"', lyrics)

It should explain why the escaped quotes exist without repeating how the regex works. You have quite some information on this in your PR, maybe we can condense it into a sentence?

@snejus
Copy link
Member

snejus commented Dec 24, 2025

Tested with:

* Caparezza - "Argenti Vive" (Italian, many accented characters)

* WestsideGunn - "Heel Cena" (escaped quotes in lyrics)

Are you happy to add both cases to test/plugins/lyrics_pages.py?

@grillofran grillofran requested a review from semohr December 26, 2025 20:05
@grillofran
Copy link
Author

grillofran commented Dec 26, 2025

Thanks for the feedback! I've updated the PR to address all your points.

  • Made the encoding fix Genius-only with a force_utf8 parameter (thanks @snejus)
  • Simplified the encoding check to r.encoding = r.encoding or "utf-8" (thanks @semohr)
  • Updated the comment to focus on why the escapes exist
  • Added test cases for Caparezza (Italian accents) and Arctic Monkeys (quoted lyrics), as the other one was... too explicit

Funny enough.... I couldn't find any issues reporting this in the repo's issue tracker, which I thought was strange given how many people must use Genius. Also, this is my first contribution to beets so apologies if I missed anything. Still learning the ropes! Let me know if there's anything else that needs adjusting.

@grillofran
Copy link
Author

grillofran commented Dec 31, 2025

Hey folks (happy holidays)

Ah, got it. I think I had misunderstood. I added the else: r.encoding = None to restore the original behavior for Google/AZLyrics. Should fix the Chinese character issue in the test. Setting r.encoding = None explicitly should fix that while keeping the UTF-8 fix for Genius only. Thoughts?

Thanks!

Comment on lines 576 to +580
html_text = cls.remove_backslash(m[0]).replace(r"\n", "\n")
return cls.get_soup(html_text).get_text().strip()
lyrics = cls.get_soup(html_text).get_text().strip()
# Genius embeds lyrics in JSON; escape sequences remain after parsing
lyrics = re.sub(r'\\+"', '"', lyrics)
return lyrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've sorted the encoding issue, I wonder whether this could possibly be solved by improving the pattern used by cls.remove_backslash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants