-
Notifications
You must be signed in to change notification settings - Fork 63
Fix slurp_raw for files larger than 2GB #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you. I'll take a look but it may not be until next week. |
|
For the second hunk I’m somewhat uncomfortable with sometimes switching to if ( defined($binmode) and $binmode eq ":unix" ) {
# Use syswrite in a loop to handle write() syscall size limit (~2GB)
for my $data ( map { ref eq 'ARRAY' ? @$_ : $_ } @data ) {
my $total_left = length $data;
my $total_written = 0;
my $rc = 0;
while ( $total_left and ( $rc = syswrite $fh, $data, $total_left, $total_written ) ) {
$total_left -= $rc;
$total_written += $rc;
}
$self->_throw('syswrite', $temp->[PATH]) unless defined $rc;
}
} |
48a9d7e to
161da60
Compare
|
@ap thanks for the review & feedback. I have incorporated it and force-pushed. |
|
Thanks. More thoughts, this time about the if ( defined($binmode) and $binmode eq ":unix"
and my $total_left = -s $fh )
{
# Read in a loop to handle read() syscall size limit (~2GB)
my $buf = "";
my $total_read = 0;
my $rc = 0;
while ( $rc = read $fh, $buf, $total_left, $total_read ) {
$total_read += $rc;
# Ensure we will keep read()ing until we get 0 or undef
# even if someone else changed the file length from under us
$total_left = ( -s $fh ) - $total_read;
$total_left = 1 if $total_left < 1;
}
$self->_throw('read') unless defined $rc;
return $buf;
} |
The read() and write() system calls on Linux have a maximum single operation limit of approximately SSIZE_MAX (~2.1GB). When using the :unix PerlIO layer (which bypasses buffering), this limit caused silent data truncation for large files. Affected methods: - slurp_raw / slurp with binmode => ":unix" - spew_raw / spew_utf8 (when Unicode::UTF8 is available) For example, reading or writing a 3GB file would silently truncate to ~2.1GB. Fix by using loops that continue reading/writing until all data is processed: - slurp: loop with 4-argument read() to append at correct offset - spew: loop with 4-argument syswrite() over each data element, avoiding unnecessary data copying The buffered PerlIO path (regular slurp/spew without :unix) was not affected as PerlIO handles chunking internally.
161da60 to
f08d5d6
Compare
|
@ap thanks again, adjusted! |
|
The read part looks like more or less the same as what I'm doing in File::Slurper. I never got around to implementing the fast writer path because it is a PITA. |
|
FYI: I am not ignoring this, but given some of the complexity, I'm going to move slowly on this and haven't had a chance to do a close read. Thank you, everyone, for your feedback so far. |
The read() system call on Linux has a maximum single-read limit of
approximately SSIZE_MAX (~2.1GB). For files larger than this limit,
a single read() call returns fewer bytes than requested.
The previous implementation assumed a single read() would return all
requested bytes, causing silent data truncation for large files.
For example, a 3GB file would only return ~2.1GB of data.
Fix by looping until all bytes are read or EOF is reached, using the
4-argument form of read() to append at the correct buffer offset.
Fixes: #316