Skip to content

Very slow linux download speed, excessive disk and cpu usage #83

@ghost

Description

https://github.com/spring/pr-downloader/blob/master/src/Downloader/Rapid/Sdp.cpp#L197-L205

This closes each downloaded file, then reopens/rereads and checks md5:

bool CFileSystem::fileIsValid(const FileData* mod, const std::string& filename) const
{
HashMD5 md5hash;
int bytes;
unsigned char data[IO_BUF_SIZE];
FILE* f= propen(filename.c_str(), "rb");
gzFile inFile = gzdopen (fileno(f), "rb");

Since the curl write calls are sequential:

curl_easy_setopt(curlw->GetHandle(), CURLOPT_WRITEFUNCTION, write_streamed_data);

... this blocks the download loop until each file is opened, written to, closed, re-opened, loaded in a gzip file object, md5 checked and closed again.

There are much better/faster ways to check the files in-memory, during streamer download, by chunks, using gzip, md5 and crc objects - crc might be the fastest and the sdp provides such data, python example (though thoroughly untested early code that's really unstable):

https://github.com/serg9/rapid_nonet/blob/master/receive_loop.py#L118-L146

There's only one file close, no-rereading the file from disk, using zlib.decompressobj and binascii.crc32, they are meant for on-the-fly stream decompression, there have to be c++ ways to do it too.

Memory usage is not a problem since each received chunk fits comfortably in RAM. I don't know if I have some disk issues or it's my current fstab setup, but the current version of pr-downloader is using my disk to 100% and the download speed is ~100-500kbps max (since it waits for disk). In-ram decompression+crc check is substantially faster on my machine.

For me, this makes it unusable under Linux, might also be related to how disk caching is handled, but I didn't look into that because python kind of handles that on its own.

I can't replicate the crash I told you about now but I'll keep trying.

If not enough linux users get this, it should be marked low priority (I have special mounts).

strace http://pastebin.com/f0HEJ1X6
with time of the day http://pastebin.com/dV6070GX , this makes the sluggishness obvious.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions