Skip to content

Conversation

coolnipunj
Copy link
Collaborator

Add Zip File Support

Description

This PR adds support for processing zip files in sum-buddy, allowing users to generate checksums for both zip files and their contents. This feature enhances the package's functionality by enabling comprehensive checksum generation for archived files.

Implementation Details

  • Created a new ArchiveHandler class in archive.py to manage zip file processing
  • Modified Mapper class to detect and process zip files
  • Updated Hasher class to handle both file paths and file-like objects
  • Enhanced get_checksums function to process both zip files and their contents
  • Added comprehensive test coverage for zip file handling:
    • Introduced tests/test_archive.py with a variety of tests for zip support
    • Added a sample archive file tests/test_archive.zip for automated testing
  • Updated the README to document the new test file and zip support

Testing

The implementation has been tested with:

  • Regular files
  • Zip files containing multiple files
  • Nested directory structures within zip files
  • Various file types and sizes
  • Automated tests (python -m pytest -v) all pass, including the new archive tests
  • Linter (ruff) passes with no errors

Design Decisions

  • Temporary Directory Approach: Using a temporary directory for zip extraction ensures clean handling of zip contents without leaving residual files
  • Dual Processing: Processing both the zip file itself and its contents provides complete checksum coverage
  • File-like Object Support: Enhanced Hasher class to handle both file paths and file-like objects for better flexibility
  • Clean Integration: Maintained the existing API while adding zip support, ensuring backward compatibility

coolnipunj and others added 7 commits June 13, 2025 14:59
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…cs/sum-buddy into feature/zip-support-new

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@coolnipunj coolnipunj requested a review from thompsonmj June 18, 2025 21:13
@hlapp hlapp requested a review from Copilot June 18, 2025 21:18
Copilot

This comment was marked as outdated.

Copy link
Collaborator

@thompsonmj thompsonmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like approach in __main__.py to avoid unneeded disk I/O and feeding in-memory file-like objects to the hasher ... but we still have disk I/O in archive.py just to construct the paths!
It would be good to move that virtual file logic from __main__.py into archive.py.
A clean approach could be to have the package process through the on-disk files as it does, meanwhile mapper.py can be extended to return an additional list of the archive files whose contents could be processed with streaming like you have here separately. In the archive.py, the streaming function could be specified like def stream_<archive-type>, with the appropriate support library loaded in the context of the function (import zipfile for def stream_zip, import tarfile for def stream_tar, etc.), which seems like a reasonable way to make it modular and extensible.

Besides, as Copilot pointed out, the current temp directory approach also risks orphaned temp files that don't get cleaned up in the current implementation.

An alternative that takes advantage of the streaming approach you use in __main__.py instead from a dedicated location inside archive.py would make most sense (even if different from the proposal above).

@thompsonmj thompsonmj requested a review from Copilot July 21, 2025 12:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds ZIP file support to sum-buddy, allowing the application to generate checksums for both ZIP files themselves and their individual contents. The implementation processes ZIP files by calculating the checksum of the archive file itself and streaming through each contained file to generate checksums without extracting to disk.

  • Modified the Mapper class to return separate lists for regular files and ZIP archives
  • Enhanced the Hasher class to accept both file paths and file-like objects for streaming ZIP content processing
  • Updated the main get_checksums function to handle dual processing of ZIP files and their contents

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_mapper.py Updated test assertions to handle new tuple return format from gather_file_paths
tests/test_getChecksums.py Updated mock return values to match new tuple format from gather_file_paths
tests/test_archive.py Added comprehensive test coverage for new ZIP functionality
src/sumbuddy/mapper.py Modified gather_file_paths to return separate lists for regular files and ZIP archives
src/sumbuddy/hasher.py Enhanced checksum_file to handle both file paths and file-like objects
src/sumbuddy/archive.py New ArchiveHandler class for ZIP file processing with streaming support
src/sumbuddy/main.py Updated get_checksums to process both regular files and ZIP archives
README.md Added documentation for ZIP support feature
Comments suppressed due to low confidence (1)

tests/test_mapper.py:71

  • The test_gather_file_paths_empty method needs to be updated to handle the new tuple return format from gather_file_paths. The test should verify both regular_files and zip_archives are returned as empty lists.
    def test_gather_file_paths_empty(self):

Comment on lines 33 to +34
Generate list of file paths in the input directory based on ignore pattern rules.

Parameters:
------------
input_directory - String. Directory to traverse for files.
ignore_file - String [optional]. Filepath for the ignore patterns file.
include_hidden - Boolean [optional]. Whether to include hidden files.

Returns:
---------
file_paths - List. Files in input_directory that are not ignored.
Returns a tuple: (regular_files, zip_archives)
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should be updated to include the complete parameter and return documentation that was removed. The current docstring is incomplete and missing parameter descriptions and detailed return value information.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep the inputs back in the format they were in previously

for chunk in iter(lambda: f.read(4096), b""):
hash_func.update(chunk)
else:
# Assume it's a file-like object
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file-like object handling should include error handling for cases where the object doesn't support the read() method. Consider adding a try-except block or type checking to provide a clearer error message.

Suggested change
# Assume it's a file-like object
# Validate that the object supports the 'read' method
if not hasattr(file_path_or_obj, 'read'):
raise TypeError("The provided object is not a valid file-like object. It must support the 'read()' method.")

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time concocting a scenario where not hasattr(file_path_or_obj, 'read') would be true. The gather_file_paths and ArchiveHandler.stream_zip() should always return something that can be read. The only thing I can think of is if someone tries to manually do something like checksum_file(321) which would give a pretty clear error already of AttributeError: 'int' object has no attribute 'read'. I think we can leave this one alone for now.

def __init__(self):
self.temp_dir = None

def process_zip(self, zip_path, root_dir):
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process_zip method creates temporary directories but the cleanup responsibility is unclear. Consider using a context manager or ensuring cleanup is called consistently to prevent temporary directory accumulation.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, with stream_zip in place, process_zip can be removed.

checksum = hasher.checksum_file(file_path, algorithm=algorithm, length=length)
writer.writerow([file_path, os.path.basename(file_path), checksum])

total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives)
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total_files calculation iterates through all ZIP archives twice - once to count files for the progress bar and again during processing. This is inefficient for large ZIP files. Consider caching the file counts or restructuring the progress tracking.

Suggested change
total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives)
total_files = len(regular_files) + len(zip_archives) # Start with regular files and ZIP archives themselves

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking for now. Can add a new issue to use file content headers and perhaps use more generic "Archive" language in __main__.py rather than referring to zip once a new archive file type is supported.

@thompsonmj
Copy link
Collaborator

@coolnipunj The zip streaming looks great!
Final change request:

  • Add the docstrings back to gather_file_paths in mapper.py.
  • Remove the now unused process_zip function from archive.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants