Fix slurp_raw for files larger than 2GB #317

igor-raits · 2025-12-16T10:36:25Z

The read() system call on Linux has a maximum single-read limit of
approximately SSIZE_MAX (~2.1GB). For files larger than this limit,
a single read() call returns fewer bytes than requested.

The previous implementation assumed a single read() would return all
requested bytes, causing silent data truncation for large files.
For example, a 3GB file would only return ~2.1GB of data.

Fix by looping until all bytes are read or EOF is reached, using the
4-argument form of read() to append at the correct buffer offset.

Fixes: #316

xdg · 2025-12-18T02:22:55Z

Thank you. I'll take a look but it may not be until next week.

ap · 2025-12-18T05:01:20Z

For the second hunk I’m somewhat uncomfortable with sometimes switching to syswrite and sometimes not, but – ya learn something new every day, even after 30 years – I went looking and found that not only (as I knew) is write not the complement to read (like sysread and syswrite are), but that apparently there isn’t any: no other I/O write function returns the number of bytes written. It astounds me that I never noticed this before. So there is no way around sysread after all, I guess. Separately from that, though, I also see a real issue with the use of join, which will always make a full copy the data before writing it – unlike the code it replaces, which just passes the scalars to print directly. (Even if the individual scalars do get copied, perl uses copy-on-write to avoid copying their actual data buffers. But the join function cannot do that, it must always copy the data. Though I don’t know if it short-circuits to a copy-on-write scalar if it is only given a single scalar to join.) And a more minor quibble is that the conditional is kinda convoluted. I think that hunk as a whole should instead be something like this:

    if ( defined($binmode) and $binmode eq ":unix" ) {
        # Use syswrite in a loop to handle write() syscall size limit (~2GB)
        for my $data ( map { ref eq 'ARRAY' ? @$_ : $_ } @data ) {
            my $total_left = length $data;
            my $total_written = 0;
            my $rc = 0;
            while ( $total_left and ( $rc = syswrite $fh, $data, $total_left, $total_written ) ) {
               $total_left -= $rc;
               $total_written += $rc;
            }
            $self->_throw('syswrite', $temp->[PATH]) unless defined $rc;
        }
    }

igor-raits · 2025-12-18T10:55:38Z

@ap thanks for the review & feedback. I have incorporated it and force-pushed.

ap · 2025-12-18T15:45:17Z

Thanks.

More thoughts, this time about the slurp hunk, which I couldn’t include in the first comment because I had to research the details a bit to put a fix together. There is a subtle TOCTOU race condition in the patch: when the read is retried, the file may have been truncated or extended by someone else. To avoid this, the loop must always run until read returns 0, regardless of what -s said about how much data there is. But -s is still necessary because read needs to allocate memory up front, therefore it has to be passed a reasonable value for the LENGTH parameter – which basically means -s has to be re-checked within the loop. Overall that leads to this:

    if ( defined($binmode) and $binmode eq ":unix"
        and my $total_left = -s $fh )
    {
        # Read in a loop to handle read() syscall size limit (~2GB)
        my $buf = "";
        my $total_read = 0;
        my $rc = 0;
        while ( $rc = read $fh, $buf, $total_left, $total_read ) {
            $total_read += $rc;
            # Ensure we will keep read()ing until we get 0 or undef
            # even if someone else changed the file length from under us
            $total_left = ( -s $fh ) - $total_read;
            $total_left = 1 if $total_left < 1;
        }
        $self->_throw('read') unless defined $rc;
        return $buf;
    }

The read() and write() system calls on Linux have a maximum single operation limit of approximately SSIZE_MAX (~2.1GB). When using the :unix PerlIO layer (which bypasses buffering), this limit caused silent data truncation for large files. Affected methods: - slurp_raw / slurp with binmode => ":unix" - spew_raw / spew_utf8 (when Unicode::UTF8 is available) For example, reading or writing a 3GB file would silently truncate to ~2.1GB. Fix by using loops that continue reading/writing until all data is processed: - slurp: loop with 4-argument read() to append at correct offset - spew: loop with 4-argument syswrite() over each data element, avoiding unnecessary data copying The buffered PerlIO path (regular slurp/spew without :unix) was not affected as PerlIO handles chunking internally.

igor-raits · 2025-12-19T10:53:53Z

@ap thanks again, adjusted!

Leont · 2025-12-19T13:17:16Z

The read part looks like more or less the same as what I'm doing in File::Slurper. I never got around to implementing the fast writer path because it is a PITA.

xdg · 2025-12-19T16:21:05Z

FYI: I am not ignoring this, but given some of the complexity, I'm going to move slowly on this and haven't had a chance to do a close read. Thank you, everyone, for your feedback so far.

igor-raits force-pushed the ssize_max branch from 48a9d7e to 161da60 Compare December 18, 2025 10:55

igor-raits force-pushed the ssize_max branch from 161da60 to f08d5d6 Compare December 19, 2025 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix slurp_raw for files larger than 2GB #317

Fix slurp_raw for files larger than 2GB #317

Uh oh!

igor-raits commented Dec 16, 2025

Uh oh!

xdg commented Dec 18, 2025

Uh oh!

ap commented Dec 18, 2025 •

edited

Loading

Uh oh!

igor-raits commented Dec 18, 2025

Uh oh!

ap commented Dec 18, 2025

Uh oh!

igor-raits commented Dec 19, 2025

Uh oh!

Leont commented Dec 19, 2025

Uh oh!

xdg commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix slurp_raw for files larger than 2GB #317

Are you sure you want to change the base?

Fix slurp_raw for files larger than 2GB #317

Uh oh!

Conversation

igor-raits commented Dec 16, 2025

Uh oh!

xdg commented Dec 18, 2025

Uh oh!

ap commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

igor-raits commented Dec 18, 2025

Uh oh!

ap commented Dec 18, 2025

Uh oh!

igor-raits commented Dec 19, 2025

Uh oh!

Leont commented Dec 19, 2025

Uh oh!

xdg commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ap commented Dec 18, 2025 •

edited

Loading