-
Notifications
You must be signed in to change notification settings - Fork 0
Description
This new interface method is needed:
Create chunks from a zchunk file, compress them, write a header.
Or in short: Create a zchunk file.
The same rules as in zchunk/zchunk#4 should apply.
I'd suggest a lower bound of 64 bytes and an upper bound of 1 Mbyte [for uncompressed chunks].
there are options ZCK_CHUNK_MIN and ZCK_CHUNK_MAX that can be set, but the defaults are 1 byte for the min and 10MB for the max.
Things which need to be implemented:
Basic design goals
- Manual chunking
- Split string (which goes to the NEXT chunk
- Split string (which goes to the Current chunk, not the next one)
- calculate optimal chunk size (combine / split created chunks).
Maybe we need a 2-pass-algorithm, which would first try to chunk it with the given string, but if a chunk is> avg*4 || > ZCK_CHUNK_MAX, split it. If a chunk is< avg / 4 || < ZCK_CHUNK_MIN, merge it with the next one.
- automatic splitting using
buzhash- calculate optimal chunk size.
- If it is text, try chunking at line breaks with ZCK_CHUNK_MIN in mind.
- For all other files, create a 2-pass-algorithm similar to above algorithm.
- Some intelligent separators for common file formats.
- If ZCK_CHUNK_MIN and/or _MAX are not set, chose from a standard config.
- What is a good default separator? We should make even binary formats more likely to have same chunks at different versions.
- calculate optimal chunk size.
About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX
About ZCK_CHUNK_MIN: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become 1024 bytes or 1 KiB. That said, I'd say set ZCK_CHUNK_MIN to at least 5 KiB, better 10 KiB, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.
I think the ZCK_CHUNK_MAX depends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.
Default properties file / fileformats.chunking.yml
We could create a file for predefined file formats. Somethink like this: https://gist.github.com/bmhm/18b57655e0c0c8a5a38d6cdf487866e4
This file could easily be extended for other file formats.