-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix lyrics Unicode corruption and escaped quotes in Genius plugin #6233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
## Problem
The lyrics plugin has two bugs that corrupt fetched lyrics:
1. **Unicode corruption**: Characters like `ò`, `è`, `à` are corrupted to `√≤`, `√®`, etc.
2. **Escaped quotes**: Quotes appear as `\"` instead of `"` in lyrics
## Root Causes
### Issue 1: MacRoman encoding misdetection
- **Location**: `RequestHandler.fetch_text()` line 220
- **Cause**: Setting `r.encoding = None` forces requests to use `apparent_encoding`
- **Problem**: For Genius.com (and others), requests incorrectly detects MacRoman instead of UTF-8
- **Result**: UTF-8 bytes `c3 b2` (ò) decoded as MacRoman produces "√≤" (U+221A U+2264)
### Issue 2: Incomplete JSON unescape
- **Location**: `Genius.scrape()` line 576
- **Cause**: The `remove_backslash` regex doesn't handle all escape patterns in JSON
- **Problem**: Genius embeds lyrics in JSON with patterns like `\\"` and `\\\\"`
- **Result**: After BeautifulSoup processing, escaped quotes remain in final text
## Solution
### Fix 1: Trust server encoding, fallback to UTF-8
```python
# OLD: r.encoding = None
# NEW:
if not r.encoding:
r.encoding = 'utf-8'
```
- Respects server's declared encoding (UTF-8 for Genius)
- Falls back to UTF-8 if no encoding specified (safer than apparent_encoding)
- Preserves original intent of handling misconfigured servers
### Fix 2: Iteratively clean escaped quotes
```python
while '\\"' in lyrics:
lyrics = lyrics.replace('\\"', '"')
```
- Handles variable escape levels (`\"`, `\\\"`, `\\\\\"`)
- Minimal change - keeps original `remove_backslash` regex
- Applied after BeautifulSoup to avoid interfering with HTML parsing
## Testing
Tested with:
- Caparezza - "Argenti Vive" (Italian, many accented characters)
- WestsideGunn - "Heel Cena" (escaped quotes in lyrics)
Before:
```
mi si parò davanti
\\"I got big moves\\"
```
After:
```
mi si parò davanti
"I got big moves"
```
## Impact
- Fixes lyrics for all languages with non-ASCII characters
- Fixes Genius lyrics with quotes
- No breaking changes - maintains backward compatibility
- Minimal code changes (14 lines total)
|
Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've left some high level feedback:
- The triple-quoted string added after
r = self.get(...)is currently just a no-op string expression; consider converting it to a regular#comment or a proper docstring on the method for clarity and to avoid confusion. - The
while '\"' in lyricsloop could be made more robust and efficient by either limiting the number of iterations or using a regex / unescape utility (e.g.,json.loadson a quoted string fragment) to normalize escape sequences in one pass rather than repeatedly scanning and replacing.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The triple-quoted string added after `r = self.get(...)` is currently just a no-op string expression; consider converting it to a regular `#` comment or a proper docstring on the method for clarity and to avoid confusion.
- The `while '\"' in lyrics` loop could be made more robust and efficient by either limiting the number of iterations or using a regex / unescape utility (e.g., `json.loads` on a quoted string fragment) to normalize escape sequences in one pass rather than repeatedly scanning and replacing.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
## Key Improvements (addressing reviewer feedback): - ✅ Uses regex instead of while loop (more efficient, one pass) - ✅ No infinite loop risk - ✅ Handles any number of backslashes before quotes - ✅ Clear inline comments
In the fetch_text method (line ~207):
if not r.encoding:
r.encoding = "utf-8" # ← Double quotes!
In the scrape method, make sure there are 2 blank lines before the next class:
return None
class Tekstowo(SearchBackend): # ← Two blank lines above
That should pass the formatting check! The repo follows PEP 8 style (double quotes, 2 blank lines between classes).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6233 +/- ##
==========================================
+ Coverage 68.22% 68.41% +0.18%
==========================================
Files 138 138
Lines 18792 18798 +6
Branches 3167 3168 +1
==========================================
+ Hits 12821 12860 +39
+ Misses 5298 5264 -34
- Partials 673 674 +1
🚀 New features to boost your workflow:
|
|
Is the 66% patch coverage requirement strict for bug fixes? The changes are small edge-case improvements. Happy to add tests if required! |
semohr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. Since this seems to be your first contribution to beets, you might want to have a look at our developer guides. Welcome aboard!
Is the 66% patch coverage requirement strict for bug fixes? The changes are small edge-case improvements. Happy to add tests if required!
While it is not a requirement, for new code tests are always highly appreciated. Here for example showing that JSON encoded lyrics can be parsed as expected would be a good addition. Same for the inverse case showing that "normal" html lyrics still work as expected would be good (I guess the test already exists, havent look at our testsuit 😁 )
More of a general question: how come this was never raised as an issue? Seems like it should be a common issue to me. Have you tried to find related open issues yet?
beetsplug/lyrics.py
Outdated
| html_text = cls.remove_backslash(m[0]).replace(r"\n", "\n") | ||
| return cls.get_soup(html_text).get_text().strip() | ||
| lyrics = cls.get_soup(html_text).get_text().strip() | ||
| # Clean up any remaining escaped quotes (may need multiple passes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments should be minimal and focus on the why rather than the how:
# Strip HTML and normalize quotes that were left because of JSON...
return re.sub(r'\\+"', '"', lyrics)
It should explain why the escaped quotes exist without repeating how the regex works. You have quite some information on this in your PR, maybe we can condense it into a sentence?
Are you happy to add both cases to |
|
Thanks for the feedback! I've updated the PR to address all your points.
Funny enough.... I couldn't find any issues reporting this in the repo's issue tracker, which I thought was strange given how many people must use Genius. Also, this is my first contribution to beets so apologies if I missed anything. Still learning the ropes! Let me know if there's anything else that needs adjusting. |
|
Hey folks (happy holidays) Ah, got it. I think I had misunderstood. I added the Thanks! |
| html_text = cls.remove_backslash(m[0]).replace(r"\n", "\n") | ||
| return cls.get_soup(html_text).get_text().strip() | ||
| lyrics = cls.get_soup(html_text).get_text().strip() | ||
| # Genius embeds lyrics in JSON; escape sequences remain after parsing | ||
| lyrics = re.sub(r'\\+"', '"', lyrics) | ||
| return lyrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've sorted the encoding issue, I wonder whether this could possibly be solved by improving the pattern used by cls.remove_backslash?
Problem
The lyrics plugin has two bugs that corrupt fetched lyrics:
ò,è,àare corrupted to√≤,√®, etc.\"instead of"in lyricsRoot Causes
Issue 1: MacRoman encoding misdetection
RequestHandler.fetch_text()line 220r.encoding = Noneforces requests to useapparent_encodingc3 b2(ò) decoded as MacRoman produces "√≤" (U+221A U+2264)Issue 2: Incomplete JSON unescape
Genius.scrape()line 576remove_backslashregex doesn't handle all escape patterns in JSON\\"and\\\\"Solution
Fix 1: Trust server encoding, fallback to UTF-8
Fix 2: Iteratively clean escaped quotes
\",\\\",\\\\\")remove_backslashregexTesting
Tested with:
Before:
After:
Impact