-
Notifications
You must be signed in to change notification settings - Fork 5.5k
fix: SQL formatter converting newlines to Unicode escapes in string literals #26249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ode escapes in string literals Summary: ## Summary Fixed an issue where the Presto SQL formatter was converting newlines and other common whitespace characters in string literals to Unicode escape sequences (e.g., `\000A` for newline), which broke Python UDF scripts that use `python.udf.fb_invoke_python_script_inline`. ## Problem When the Presto verifier's `QueryRewriter` and `QueryRunner` processed SQL queries containing Python UDF scripts with newlines, the `formatSql` method would convert those newlines to Unicode escape sequences. This caused Python code to break because: - Newlines (`\n` = 0x0A) are outside the printable ASCII range (0x20-0x7E) - The formatter would convert them to `\000A` in Unicode escape format - Python syntax requires actual newlines for proper code structure ## Solution Modified `ExpressionFormatter.formatStringLiteral()` to preserve common whitespace characters (newlines, tabs, carriage returns) as literal characters instead of converting them to Unicode escapes: 1. **Line 713**: Updated the character matcher to include `\n`, `\r`, and `\t` as acceptable characters for simple string literals 2. **Line 723**: Added check for `isCommonWhitespace()` to preserve these characters in Unicode-escaped strings 3. **Lines 800-803**: Added new helper method `isCommonWhitespace()` to identify newline, carriage return, and tab characters ## Files Changed - `fbcode/github/presto-trunk/presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java` - Modified `formatStringLiteral()` method to preserve common whitespace - Added `isCommonWhitespace()` helper method - `fbcode/github/presto-trunk/presto-parser/src/test/java/com/facebook/presto/sql/TestExpressionFormatterNewlines.java` (new) - Added comprehensive tests for string literal formatting with newlines, tabs, and carriage returns ## Impact This fix ensures that: - Python UDF scripts maintain their proper structure when processed by the verifier - SQL queries with multi-line string literals are formatted correctly - Backward compatibility is maintained for strings without whitespace characters [Session trajectory link](https://www.internalfb.com/intern/devai/devmate/inspector/?id=T239048924-a85186c6-4366-4969-b7bc-e0e326944a81) Differential Revision: D84075861
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR enhances ExpressionFormatter.formatStringLiteral to preserve newline, tab, and carriage return characters instead of escaping them, by extending the printable-character matcher and serialization logic with a new isCommonWhitespace helper, and adds tests to verify correct handling of these whitespace characters. Class diagram for updated ExpressionFormatter string literal formattingclassDiagram
class ExpressionFormatter {
+formatStringLiteral(String s)
-isAsciiPrintable(int codePoint)
+isCommonWhitespace(int codePoint)
}
ExpressionFormatter : formatStringLiteral now preserves \n, \r, \t
ExpressionFormatter : isCommonWhitespace identifies newline, carriage return, tab
Flow diagram for string literal formatting with common whitespaceflowchart TD
A["Input string literal"] --> B["Replace single quotes with double single quotes"]
B --> C["Check if all characters are printable ASCII or common whitespace (\n, \r, \t)"]
C -- Yes --> D["Return as simple quoted string"]
C -- No --> E["Iterate code points"]
E --> F["If code point is printable ASCII or common whitespace, append as-is"]
E --> G["Else, escape as Unicode"]
F --> H["Build formatted string"]
G --> H
H --> I["Return formatted string"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Consider generalizing isCommonWhitespace (and the CharMatcher.anyOf("\n\r\t")) to include other control whitespace (e.g. form feed or vertical tab) or switch to CharMatcher.whitespace() so the formatter aligns with SQL’s full whitespace definition.
- To avoid duplication, extract the common‐whitespace matcher logic into a single reusable constant or helper method and reuse it in both the simple‐literal check and the escape loop.
- Add a test case where backslashes and common whitespace appear next to each other (e.g. "\\n" or "\t\") to ensure the formatter doesn’t accidentally escape or strip characters in those edge cases.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider generalizing isCommonWhitespace (and the CharMatcher.anyOf("\n\r\t")) to include other control whitespace (e.g. form feed or vertical tab) or switch to CharMatcher.whitespace() so the formatter aligns with SQL’s full whitespace definition.
- To avoid duplication, extract the common‐whitespace matcher logic into a single reusable constant or helper method and reuse it in both the simple‐literal check and the escape loop.
- Add a test case where backslashes and common whitespace appear next to each other (e.g. "\\\n" or "\t\\") to ensure the formatter doesn’t accidentally escape or strip characters in those edge cases.
## Individual Comments
### Comment 1
<location> `presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java:723` </location>
<code_context>
int codePoint = iterator.nextInt();
checkArgument(codePoint >= 0, "Invalid UTF-8 encoding in characters: %s", s);
- if (isAsciiPrintable(codePoint)) {
+ if (isAsciiPrintable(codePoint) || isCommonWhitespace(codePoint)) {
char ch = (char) codePoint;
if (ch == '\\') {
</code_context>
<issue_to_address>
**issue (bug_risk):** Check for surrogate pairs and non-BMP code points.
Casting codePoint to char does not support non-BMP Unicode characters. Please handle code points above U+FFFF to prevent data loss or formatting errors.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
int codePoint = iterator.nextInt(); | ||
checkArgument(codePoint >= 0, "Invalid UTF-8 encoding in characters: %s", s); | ||
if (isAsciiPrintable(codePoint)) { | ||
if (isAsciiPrintable(codePoint) || isCommonWhitespace(codePoint)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Check for surrogate pairs and non-BMP code points.
Casting codePoint to char does not support non-BMP Unicode characters. Please handle code points above U+FFFF to prevent data loss or formatting errors.
Summary:
Summary
Fixed an issue where the Presto SQL formatter was converting newlines and other common whitespace characters in string literals to Unicode escape sequences (e.g.,
\000A
for newline), which broke Python UDF scripts that usepython.udf.fb_invoke_python_script_inline
.Problem
When the Presto verifier's
QueryRewriter
andQueryRunner
processed SQL queries containing Python UDF scripts with newlines, theformatSql
method would convert those newlines to Unicode escape sequences. This caused Python code to break because:\n
= 0x0A) are outside the printable ASCII range (0x20-0x7E)\000A
in Unicode escape formatSolution
Modified
ExpressionFormatter.formatStringLiteral()
to preserve common whitespace characters (newlines, tabs, carriage returns) as literal characters instead of converting them to Unicode escapes:\n
,\r
, and\t
as acceptable characters for simple string literalsisCommonWhitespace()
to preserve these characters in Unicode-escaped stringsisCommonWhitespace()
to identify newline, carriage return, and tab charactersFiles Changed
fbcode/github/presto-trunk/presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java
formatStringLiteral()
method to preserve common whitespaceisCommonWhitespace()
helper methodfbcode/github/presto-trunk/presto-parser/src/test/java/com/facebook/presto/sql/TestExpressionFormatterNewlines.java
(new)Impact
This fix ensures that:
Session trajectory link
Differential Revision: D84075861
Release Notes