Skip to content

Conversation

abhinavmuk04
Copy link
Contributor

@abhinavmuk04 abhinavmuk04 commented Oct 7, 2025

Summary:

Summary

Fixed an issue where the Presto SQL formatter was converting newlines and other common whitespace characters in string literals to Unicode escape sequences (e.g., \000A for newline), which broke Python UDF scripts that use python.udf.fb_invoke_python_script_inline.

Problem

When the Presto verifier's QueryRewriter and QueryRunner processed SQL queries containing Python UDF scripts with newlines, the formatSql method would convert those newlines to Unicode escape sequences. This caused Python code to break because:

  • Newlines (\n = 0x0A) are outside the printable ASCII range (0x20-0x7E)
  • The formatter would convert them to \000A in Unicode escape format
  • Python syntax requires actual newlines for proper code structure

Solution

Modified ExpressionFormatter.formatStringLiteral() to preserve common whitespace characters (newlines, tabs, carriage returns) as literal characters instead of converting them to Unicode escapes:

  1. Line 713: Updated the character matcher to include \n, \r, and \t as acceptable characters for simple string literals
  2. Line 723: Added check for isCommonWhitespace() to preserve these characters in Unicode-escaped strings
  3. Lines 800-803: Added new helper method isCommonWhitespace() to identify newline, carriage return, and tab characters

Files Changed

  • fbcode/github/presto-trunk/presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java

    • Modified formatStringLiteral() method to preserve common whitespace
    • Added isCommonWhitespace() helper method
  • fbcode/github/presto-trunk/presto-parser/src/test/java/com/facebook/presto/sql/TestExpressionFormatterNewlines.java (new)

    • Added comprehensive tests for string literal formatting with newlines, tabs, and carriage returns

Impact

This fix ensures that:

  • Python UDF scripts maintain their proper structure when processed by the verifier
  • SQL queries with multi-line string literals are formatted correctly
  • Backward compatibility is maintained for strings without whitespace characters

Session trajectory link

Differential Revision: D84075861

Release Notes

== No RELEASE NOTES ==

…ode escapes in string literals

Summary:
## Summary

Fixed an issue where the Presto SQL formatter was converting newlines and other common whitespace characters in string literals to Unicode escape sequences (e.g., `\000A` for newline), which broke Python UDF scripts that use `python.udf.fb_invoke_python_script_inline`.

## Problem

When the Presto verifier's `QueryRewriter` and `QueryRunner` processed SQL queries containing Python UDF scripts with newlines, the `formatSql` method would convert those newlines to Unicode escape sequences. This caused Python code to break because:
- Newlines (`\n` = 0x0A) are outside the printable ASCII range (0x20-0x7E)
- The formatter would convert them to `\000A` in Unicode escape format
- Python syntax requires actual newlines for proper code structure

## Solution

Modified `ExpressionFormatter.formatStringLiteral()` to preserve common whitespace characters (newlines, tabs, carriage returns) as literal characters instead of converting them to Unicode escapes:

1. **Line 713**: Updated the character matcher to include `\n`, `\r`, and `\t` as acceptable characters for simple string literals
2. **Line 723**: Added check for `isCommonWhitespace()` to preserve these characters in Unicode-escaped strings
3. **Lines 800-803**: Added new helper method `isCommonWhitespace()` to identify newline, carriage return, and tab characters

## Files Changed

- `fbcode/github/presto-trunk/presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java`
  - Modified `formatStringLiteral()` method to preserve common whitespace
  - Added `isCommonWhitespace()` helper method

- `fbcode/github/presto-trunk/presto-parser/src/test/java/com/facebook/presto/sql/TestExpressionFormatterNewlines.java` (new)
  - Added comprehensive tests for string literal formatting with newlines, tabs, and carriage returns

## Impact

This fix ensures that:
- Python UDF scripts maintain their proper structure when processed by the verifier
- SQL queries with multi-line string literals are formatted correctly
- Backward compatibility is maintained for strings without whitespace characters

[Session trajectory link](https://www.internalfb.com/intern/devai/devmate/inspector/?id=T239048924-a85186c6-4366-4969-b7bc-e0e326944a81)

Differential Revision: D84075861
@abhinavmuk04 abhinavmuk04 requested a review from a team as a code owner October 7, 2025 19:05
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Oct 7, 2025
Copy link
Contributor

sourcery-ai bot commented Oct 7, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR enhances ExpressionFormatter.formatStringLiteral to preserve newline, tab, and carriage return characters instead of escaping them, by extending the printable-character matcher and serialization logic with a new isCommonWhitespace helper, and adds tests to verify correct handling of these whitespace characters.

Class diagram for updated ExpressionFormatter string literal formatting

classDiagram
class ExpressionFormatter {
  +formatStringLiteral(String s)
  -isAsciiPrintable(int codePoint)
  +isCommonWhitespace(int codePoint)
}
ExpressionFormatter : formatStringLiteral now preserves \n, \r, \t
ExpressionFormatter : isCommonWhitespace identifies newline, carriage return, tab
Loading

Flow diagram for string literal formatting with common whitespace

flowchart TD
    A["Input string literal"] --> B["Replace single quotes with double single quotes"]
    B --> C["Check if all characters are printable ASCII or common whitespace (\n, \r, \t)"]
    C -- Yes --> D["Return as simple quoted string"]
    C -- No --> E["Iterate code points"]
    E --> F["If code point is printable ASCII or common whitespace, append as-is"]
    E --> G["Else, escape as Unicode"]
    F --> H["Build formatted string"]
    G --> H
    H --> I["Return formatted string"]
Loading

File-Level Changes

Change Details Files
Support common whitespace in string literal formatting
  • Extend CharMatcher to include '\n', '\r', and '\t' as printable
  • Adjust serialization loop to treat common whitespace as literal via isCommonWhitespace
  • Implement isCommonWhitespace helper method to identify newline, carriage return, and tab
presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java
Add tests for multiline and whitespace-preserving string literals
  • Add assertions for newline, tab, and carriage return preservation in string literals
  • Cover inline Python UDF scripts with embedded whitespace
presto-parser/src/test/java/com/facebook/presto/sql/parser/TestStatementBuilder.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@abhinavmuk04 abhinavmuk04 changed the title [PythonUdf-Reliability] Fix SQL formatter converting newlines to Unicode escapes in string literals fix: SQL formatter converting newlines to Unicode escapes in string literals Oct 7, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider generalizing isCommonWhitespace (and the CharMatcher.anyOf("\n\r\t")) to include other control whitespace (e.g. form feed or vertical tab) or switch to CharMatcher.whitespace() so the formatter aligns with SQL’s full whitespace definition.
  • To avoid duplication, extract the common‐whitespace matcher logic into a single reusable constant or helper method and reuse it in both the simple‐literal check and the escape loop.
  • Add a test case where backslashes and common whitespace appear next to each other (e.g. "\\n" or "\t\") to ensure the formatter doesn’t accidentally escape or strip characters in those edge cases.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider generalizing isCommonWhitespace (and the CharMatcher.anyOf("\n\r\t")) to include other control whitespace (e.g. form feed or vertical tab) or switch to CharMatcher.whitespace() so the formatter aligns with SQL’s full whitespace definition.
- To avoid duplication, extract the common‐whitespace matcher logic into a single reusable constant or helper method and reuse it in both the simple‐literal check and the escape loop.
- Add a test case where backslashes and common whitespace appear next to each other (e.g. "\\\n" or "\t\\") to ensure the formatter doesn’t accidentally escape or strip characters in those edge cases.

## Individual Comments

### Comment 1
<location> `presto-parser/src/main/java/com/facebook/presto/sql/ExpressionFormatter.java:723` </location>
<code_context>
             int codePoint = iterator.nextInt();
             checkArgument(codePoint >= 0, "Invalid UTF-8 encoding in characters: %s", s);
-            if (isAsciiPrintable(codePoint)) {
+            if (isAsciiPrintable(codePoint) || isCommonWhitespace(codePoint)) {
                 char ch = (char) codePoint;
                 if (ch == '\\') {
</code_context>

<issue_to_address>
**issue (bug_risk):** Check for surrogate pairs and non-BMP code points.

Casting codePoint to char does not support non-BMP Unicode characters. Please handle code points above U+FFFF to prevent data loss or formatting errors.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

int codePoint = iterator.nextInt();
checkArgument(codePoint >= 0, "Invalid UTF-8 encoding in characters: %s", s);
if (isAsciiPrintable(codePoint)) {
if (isAsciiPrintable(codePoint) || isCommonWhitespace(codePoint)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Check for surrogate pairs and non-BMP code points.

Casting codePoint to char does not support non-BMP Unicode characters. Please handle code points above U+FFFF to prevent data loss or formatting errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:Meta PR from Meta
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants