-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/zip support #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…cs/sum-buddy into feature/zip-support-new # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like approach in __main__.py
to avoid unneeded disk I/O and feeding in-memory file-like objects to the hasher ... but we still have disk I/O in archive.py
just to construct the paths!
It would be good to move that virtual file logic from __main__.py
into archive.py
.
A clean approach could be to have the package process through the on-disk files as it does, meanwhile mapper.py
can be extended to return an additional list of the archive files whose contents could be processed with streaming like you have here separately. In the archive.py
, the streaming function could be specified like def stream_<archive-type>
, with the appropriate support library loaded in the context of the function (import zipfile
for def stream_zip
, import tarfile
for def stream_tar
, etc.), which seems like a reasonable way to make it modular and extensible.
Besides, as Copilot pointed out, the current temp directory approach also risks orphaned temp files that don't get cleaned up in the current implementation.
An alternative that takes advantage of the streaming approach you use in __main__.py
instead from a dedicated location inside archive.py
would make most sense (even if different from the proposal above).
…updated tests and docs
…streaming ZIP logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds ZIP file support to sum-buddy, allowing the application to generate checksums for both ZIP files themselves and their individual contents. The implementation processes ZIP files by calculating the checksum of the archive file itself and streaming through each contained file to generate checksums without extracting to disk.
- Modified the Mapper class to return separate lists for regular files and ZIP archives
- Enhanced the Hasher class to accept both file paths and file-like objects for streaming ZIP content processing
- Updated the main get_checksums function to handle dual processing of ZIP files and their contents
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
tests/test_mapper.py | Updated test assertions to handle new tuple return format from gather_file_paths |
tests/test_getChecksums.py | Updated mock return values to match new tuple format from gather_file_paths |
tests/test_archive.py | Added comprehensive test coverage for new ZIP functionality |
src/sumbuddy/mapper.py | Modified gather_file_paths to return separate lists for regular files and ZIP archives |
src/sumbuddy/hasher.py | Enhanced checksum_file to handle both file paths and file-like objects |
src/sumbuddy/archive.py | New ArchiveHandler class for ZIP file processing with streaming support |
src/sumbuddy/main.py | Updated get_checksums to process both regular files and ZIP archives |
README.md | Added documentation for ZIP support feature |
Comments suppressed due to low confidence (1)
tests/test_mapper.py:71
- The test_gather_file_paths_empty method needs to be updated to handle the new tuple return format from gather_file_paths. The test should verify both regular_files and zip_archives are returned as empty lists.
def test_gather_file_paths_empty(self):
Generate list of file paths in the input directory based on ignore pattern rules. | ||
|
||
Parameters: | ||
------------ | ||
input_directory - String. Directory to traverse for files. | ||
ignore_file - String [optional]. Filepath for the ignore patterns file. | ||
include_hidden - Boolean [optional]. Whether to include hidden files. | ||
|
||
Returns: | ||
--------- | ||
file_paths - List. Files in input_directory that are not ignored. | ||
Returns a tuple: (regular_files, zip_archives) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring should be updated to include the complete parameter and return documentation that was removed. The current docstring is incomplete and missing parameter descriptions and detailed return value information.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's keep the inputs back in the format they were in previously
for chunk in iter(lambda: f.read(4096), b""): | ||
hash_func.update(chunk) | ||
else: | ||
# Assume it's a file-like object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file-like object handling should include error handling for cases where the object doesn't support the read() method. Consider adding a try-except block or type checking to provide a clearer error message.
# Assume it's a file-like object | |
# Validate that the object supports the 'read' method | |
if not hasattr(file_path_or_obj, 'read'): | |
raise TypeError("The provided object is not a valid file-like object. It must support the 'read()' method.") |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time concocting a scenario where not hasattr(file_path_or_obj, 'read')
would be true. The gather_file_paths
and ArchiveHandler.stream_zip()
should always return something that can be read. The only thing I can think of is if someone tries to manually do something like checksum_file(321)
which would give a pretty clear error already of AttributeError: 'int' object has no attribute 'read'
. I think we can leave this one alone for now.
def __init__(self): | ||
self.temp_dir = None | ||
|
||
def process_zip(self, zip_path, root_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process_zip method creates temporary directories but the cleanup responsibility is unclear. Consider using a context manager or ensuring cleanup is called consistently to prevent temporary directory accumulation.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with stream_zip
in place, process_zip
can be removed.
checksum = hasher.checksum_file(file_path, algorithm=algorithm, length=length) | ||
writer.writerow([file_path, os.path.basename(file_path), checksum]) | ||
|
||
total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total_files calculation iterates through all ZIP archives twice - once to count files for the progress bar and again during processing. This is inefficient for large ZIP files. Consider caching the file counts or restructuring the progress tracking.
total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives) | |
total_files = len(regular_files) + len(zip_archives) # Start with regular files and ZIP archives themselves |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking for now. Can add a new issue to use file content headers and perhaps use more generic "Archive" language in __main__.py
rather than referring to zip once a new archive file type is supported.
@coolnipunj The zip streaming looks great!
|
Add Zip File Support
Description
This PR adds support for processing zip files in sum-buddy, allowing users to generate checksums for both zip files and their contents. This feature enhances the package's functionality by enabling comprehensive checksum generation for archived files.
Implementation Details
ArchiveHandler
class inarchive.py
to manage zip file processingMapper
class to detect and process zip filesHasher
class to handle both file paths and file-like objectsget_checksums
function to process both zip files and their contentstests/test_archive.py
with a variety of tests for zip supporttests/test_archive.zip
for automated testingTesting
The implementation has been tested with:
python -m pytest -v
) all pass, including the new archive testsDesign Decisions
Hasher
class to handle both file paths and file-like objects for better flexibility