Fix lyrics Unicode corruption and escaped quotes in Genius plugin #6233

grillofran · 2025-12-23T20:31:51Z

Problem

The lyrics plugin has two bugs that corrupt fetched lyrics:

Unicode corruption: Characters like ò, è, à are corrupted to √≤, √®, etc.
Escaped quotes: Quotes appear as \" instead of " in lyrics

Root Causes

Issue 1: MacRoman encoding misdetection

Location: RequestHandler.fetch_text() line 220
Cause: Setting r.encoding = None forces requests to use apparent_encoding
Problem: For Genius.com (and others), requests incorrectly detects MacRoman instead of UTF-8
Result: UTF-8 bytes c3 b2 (ò) decoded as MacRoman produces "√≤" (U+221A U+2264)

Issue 2: Incomplete JSON unescape

Location: Genius.scrape() line 576
Cause: The remove_backslash regex doesn't handle all escape patterns in JSON
Problem: Genius embeds lyrics in JSON with patterns like \\" and \\\\"
Result: After BeautifulSoup processing, escaped quotes remain in final text

Solution

Fix 1: Trust server encoding, fallback to UTF-8

# OLD: r.encoding = None
# NEW:
if not r.encoding:
    r.encoding = 'utf-8'

Respects server's declared encoding (UTF-8 for Genius)
Falls back to UTF-8 if no encoding specified (safer than apparent_encoding)
Preserves original intent of handling misconfigured servers

Fix 2: Iteratively clean escaped quotes

while '\\"' in lyrics:
    lyrics = lyrics.replace('\\"', '"')

Handles variable escape levels (\", \\\", \\\\\")
Minimal change - keeps original remove_backslash regex
Applied after BeautifulSoup to avoid interfering with HTML parsing

Testing

Tested with:

Caparezza - "Argenti Vive" (Italian, many accented characters)
WestsideGunn - "Heel Cena" (escaped quotes in lyrics)

Before:

mi si par√≤ davanti
\\"I got big moves\\"

After:

mi si parò davanti
"I got big moves"

Impact

Fixes lyrics for all languages with non-ASCII characters
Fixes Genius lyrics with quotes
No breaking changes - maintains backward compatibility
Minimal code changes (14 lines total)

## Problem The lyrics plugin has two bugs that corrupt fetched lyrics: 1. **Unicode corruption**: Characters like `ò`, `è`, `à` are corrupted to `√≤`, `√®`, etc. 2. **Escaped quotes**: Quotes appear as `\"` instead of `"` in lyrics ## Root Causes ### Issue 1: MacRoman encoding misdetection - **Location**: `RequestHandler.fetch_text()` line 220 - **Cause**: Setting `r.encoding = None` forces requests to use `apparent_encoding` - **Problem**: For Genius.com (and others), requests incorrectly detects MacRoman instead of UTF-8 - **Result**: UTF-8 bytes `c3 b2` (ò) decoded as MacRoman produces "√≤" (U+221A U+2264) ### Issue 2: Incomplete JSON unescape - **Location**: `Genius.scrape()` line 576 - **Cause**: The `remove_backslash` regex doesn't handle all escape patterns in JSON - **Problem**: Genius embeds lyrics in JSON with patterns like `\\"` and `\\\\"` - **Result**: After BeautifulSoup processing, escaped quotes remain in final text ## Solution ### Fix 1: Trust server encoding, fallback to UTF-8 ```python # OLD: r.encoding = None # NEW: if not r.encoding: r.encoding = 'utf-8' ``` - Respects server's declared encoding (UTF-8 for Genius) - Falls back to UTF-8 if no encoding specified (safer than apparent_encoding) - Preserves original intent of handling misconfigured servers ### Fix 2: Iteratively clean escaped quotes ```python while '\\"' in lyrics: lyrics = lyrics.replace('\\"', '"') ``` - Handles variable escape levels (`\"`, `\\\"`, `\\\\\"`) - Minimal change - keeps original `remove_backslash` regex - Applied after BeautifulSoup to avoid interfering with HTML parsing ## Testing Tested with: - Caparezza - "Argenti Vive" (Italian, many accented characters) - WestsideGunn - "Heel Cena" (escaped quotes in lyrics) Before: ``` mi si par√≤ davanti \\"I got big moves\\" ``` After: ``` mi si parò davanti "I got big moves" ``` ## Impact - Fixes lyrics for all languages with non-ASCII characters - Fixes Genius lyrics with quotes - No breaking changes - maintains backward compatibility - Minimal code changes (14 lines total)

github-actions · 2025-12-23T20:32:02Z

Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.

sourcery-ai

Hey - I've left some high level feedback:

The triple-quoted string added after r = self.get(...) is currently just a no-op string expression; consider converting it to a regular # comment or a proper docstring on the method for clarity and to avoid confusion.
The while '\"' in lyrics loop could be made more robust and efficient by either limiting the number of iterations or using a regex / unescape utility (e.g., json.loads on a quoted string fragment) to normalize escape sequences in one pass rather than repeatedly scanning and replacing.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The triple-quoted string added after `r = self.get(...)` is currently just a no-op string expression; consider converting it to a regular `#` comment or a proper docstring on the method for clarity and to avoid confusion.
- The `while '\"' in lyrics` loop could be made more robust and efficient by either limiting the number of iterations or using a regex / unescape utility (e.g., `json.loads` on a quoted string fragment) to normalize escape sequences in one pass rather than repeatedly scanning and replacing.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

## Key Improvements (addressing reviewer feedback): - ✅ Uses regex instead of while loop (more efficient, one pass) - ✅ No infinite loop risk - ✅ Handles any number of backslashes before quotes - ✅ Clear inline comments

In the fetch_text method (line ~207): if not r.encoding: r.encoding = "utf-8" # ← Double quotes! In the scrape method, make sure there are 2 blank lines before the next class: return None class Tekstowo(SearchBackend): # ← Two blank lines above That should pass the formatting check! The repo follows PEP 8 style (double quotes, 2 blank lines between classes).

codecov · 2025-12-23T20:54:04Z

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.41%. Comparing base (e048909) to head (4cfb1e3).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
beetsplug/lyrics.py	62.50%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6233      +/-   ##
==========================================
+ Coverage   68.22%   68.41%   +0.18%     
==========================================
  Files         138      138              
  Lines       18792    18798       +6     
  Branches     3167     3168       +1     
==========================================
+ Hits        12821    12860      +39     
+ Misses       5298     5264      -34     
- Partials      673      674       +1

Files with missing lines	Coverage Δ
beetsplug/lyrics.py	`84.69% <62.50%> (+7.00%)`	⬆️

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

grillofran · 2025-12-23T21:02:00Z

Is the 66% patch coverage requirement strict for bug fixes? The changes are small edge-case improvements. Happy to add tests if required!

semohr

Thank you for the PR. Since this seems to be your first contribution to beets, you might want to have a look at our developer guides. Welcome aboard!

Is the 66% patch coverage requirement strict for bug fixes? The changes are small edge-case improvements. Happy to add tests if required!

While it is not a requirement, for new code tests are always highly appreciated. Here for example showing that JSON encoded lyrics can be parsed as expected would be a good addition. Same for the inverse case showing that "normal" html lyrics still work as expected would be good (I guess the test already exists, havent look at our testsuit 😁 )

More of a general question: how come this was never raised as an issue? Seems like it should be a common issue to me. Have you tried to find related open issues yet?

beetsplug/lyrics.py

semohr · 2025-12-24T10:42:18Z

beetsplug/lyrics.py

            html_text = cls.remove_backslash(m[0]).replace(r"\n", "\n")
-            return cls.get_soup(html_text).get_text().strip()
+            lyrics = cls.get_soup(html_text).get_text().strip()
+            # Clean up any remaining escaped quotes (may need multiple passes)


Comments should be minimal and focus on the why rather than the how:

# Strip HTML and normalize quotes that were left because of JSON... return re.sub(r'\\+"', '"', lyrics)

It should explain why the escaped quotes exist without repeating how the regex works. You have quite some information on this in your PR, maybe we can condense it into a sentence?

snejus · 2025-12-24T11:29:21Z

Tested with:

* Caparezza - "Argenti Vive" (Italian, many accented characters)

* WestsideGunn - "Heel Cena" (escaped quotes in lyrics)

Are you happy to add both cases to test/plugins/lyrics_pages.py?

grillofran · 2025-12-26T20:17:32Z

Thanks for the feedback! I've updated the PR to address all your points.

Made the encoding fix Genius-only with a force_utf8 parameter (thanks @snejus)
Simplified the encoding check to r.encoding = r.encoding or "utf-8" (thanks @semohr)
Updated the comment to focus on why the escapes exist
Added test cases for Caparezza (Italian accents) and Arctic Monkeys (quoted lyrics), as the other one was... too explicit

Funny enough.... I couldn't find any issues reporting this in the repo's issue tracker, which I thought was strange given how many people must use Genius. Also, this is my first contribution to beets so apologies if I missed anything. Still learning the ropes! Let me know if there's anything else that needs adjusting.

beetsplug/lyrics.py

grillofran · 2025-12-31T09:07:50Z

Hey folks (happy holidays)

Ah, got it. I think I had misunderstood. I added the else: r.encoding = None to restore the original behavior for Google/AZLyrics. Should fix the Chinese character issue in the test. Setting r.encoding = None explicitly should fix that while keeping the UTF-8 fix for Genius only. Thoughts?

Thanks!

snejus · 2026-01-04T03:04:08Z

beetsplug/lyrics.py

            html_text = cls.remove_backslash(m[0]).replace(r"\n", "\n")
-            return cls.get_soup(html_text).get_text().strip()
+            lyrics = cls.get_soup(html_text).get_text().strip()
+            # Genius embeds lyrics in JSON; escape sequences remain after parsing
+            lyrics = re.sub(r'\\+"', '"', lyrics)
+            return lyrics


Now that we've sorted the encoding issue, I wonder whether this could possibly be solved by improving the pattern used by cls.remove_backslash?

grillofran requested a review from a team as a code owner December 23, 2025 20:31

sourcery-ai bot reviewed Dec 23, 2025

View reviewed changes

grillofran added 3 commits December 23, 2025 22:43

Update lyrics.py

1d49413

## Key Improvements (addressing reviewer feedback): - ✅ Uses regex instead of while loop (more efficient, one pass) - ✅ No infinite loop risk - ✅ Handles any number of backslashes before quotes - ✅ Clear inline comments

Minor linting change

b831191

semohr requested changes Dec 24, 2025

View reviewed changes

grillofran added 2 commits December 26, 2025 22:01

address pr feedback

2c300fa

Merge branch 'master' into patch-1

e5f75e9

grillofran requested a review from semohr December 26, 2025 20:05

grillofran added 2 commits December 26, 2025 22:09

fix formatting and type issues

9941ffd

fix type signature

f234686

snejus requested changes Dec 26, 2025

View reviewed changes

beetsplug/lyrics.py Show resolved Hide resolved

grillofran added 3 commits December 31, 2025 10:33

add azlyrics test case

8088797

Merge branch 'master' into patch-1

9254732

set encoding to None for non-genius backends

4cfb1e3

snejus reviewed Jan 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix lyrics Unicode corruption and escaped quotes in Genius plugin #6233

Fix lyrics Unicode corruption and escaped quotes in Genius plugin #6233

Uh oh!

grillofran commented Dec 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

codecov bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

grillofran commented Dec 23, 2025

Uh oh!

semohr left a comment

Uh oh!

Uh oh!

semohr Dec 24, 2025 •

edited

Loading

Uh oh!

snejus commented Dec 24, 2025

Uh oh!

grillofran commented Dec 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

grillofran commented Dec 31, 2025 •

edited

Loading

Uh oh!

snejus Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix lyrics Unicode corruption and escaped quotes in Genius plugin #6233

Are you sure you want to change the base?

Fix lyrics Unicode corruption and escaped quotes in Genius plugin #6233

Uh oh!

Conversation

grillofran commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Causes

Issue 1: MacRoman encoding misdetection

Issue 2: Incomplete JSON unescape

Solution

Fix 1: Trust server encoding, fallback to UTF-8

Fix 2: Iteratively clean escaped quotes

Testing

Impact

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

grillofran commented Dec 23, 2025

Uh oh!

semohr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

semohr Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snejus commented Dec 24, 2025

Uh oh!

grillofran commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

grillofran commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snejus Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grillofran commented Dec 23, 2025 •

edited

Loading

codecov bot commented Dec 23, 2025 •

edited

Loading

semohr Dec 24, 2025 •

edited

Loading

grillofran commented Dec 26, 2025 •

edited

Loading

grillofran commented Dec 31, 2025 •

edited

Loading