Feature/zip support #29

coolnipunj · 2025-06-18T21:13:07Z

Add Zip File Support

Description

This PR adds support for processing zip files in sum-buddy, allowing users to generate checksums for both zip files and their contents. This feature enhances the package's functionality by enabling comprehensive checksum generation for archived files.

Implementation Details

Created a new ArchiveHandler class in archive.py to manage zip file processing
Modified Mapper class to detect and process zip files
Updated Hasher class to handle both file paths and file-like objects
Enhanced get_checksums function to process both zip files and their contents
Added comprehensive test coverage for zip file handling:
- Introduced tests/test_archive.py with a variety of tests for zip support
- Added a sample archive file tests/test_archive.zip for automated testing
Updated the README to document the new test file and zip support

Testing

The implementation has been tested with:

Regular files
Zip files containing multiple files
Nested directory structures within zip files
Various file types and sizes
Automated tests (python -m pytest -v) all pass, including the new archive tests
Linter (ruff) passes with no errors

Design Decisions

Temporary Directory Approach: Using a temporary directory for zip extraction ensures clean handling of zip contents without leaving residual files
Dual Processing: Processing both the zip file itself and its contents provides complete checksum coverage
File-like Object Support: Enhanced Hasher class to handle both file paths and file-like objects for better flexibility
Clean Integration: Maintained the existing API while adding zip support, ensuring backward compatibility

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…rage

…cs/sum-buddy into feature/zip-support-new # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

thompsonmj

I really like approach in __main__.py to avoid unneeded disk I/O and feeding in-memory file-like objects to the hasher ... but we still have disk I/O in archive.py just to construct the paths!
It would be good to move that virtual file logic from __main__.py into archive.py.
A clean approach could be to have the package process through the on-disk files as it does, meanwhile mapper.py can be extended to return an additional list of the archive files whose contents could be processed with streaming like you have here separately. In the archive.py, the streaming function could be specified like def stream_<archive-type>, with the appropriate support library loaded in the context of the function (import zipfile for def stream_zip, import tarfile for def stream_tar, etc.), which seems like a reasonable way to make it modular and extensible.

Besides, as Copilot pointed out, the current temp directory approach also risks orphaned temp files that don't get cleaned up in the current implementation.

An alternative that takes advantage of the streaming approach you use in __main__.py instead from a dedicated location inside archive.py would make most sense (even if different from the proposal above).

README.md

…updated tests and docs

…streaming ZIP logic

Copilot

Pull Request Overview

This PR adds ZIP file support to sum-buddy, allowing the application to generate checksums for both ZIP files themselves and their individual contents. The implementation processes ZIP files by calculating the checksum of the archive file itself and streaming through each contained file to generate checksums without extracting to disk.

Modified the Mapper class to return separate lists for regular files and ZIP archives
Enhanced the Hasher class to accept both file paths and file-like objects for streaming ZIP content processing
Updated the main get_checksums function to handle dual processing of ZIP files and their contents

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_mapper.py	Updated test assertions to handle new tuple return format from gather_file_paths
tests/test_getChecksums.py	Updated mock return values to match new tuple format from gather_file_paths
tests/test_archive.py	Added comprehensive test coverage for new ZIP functionality
src/sumbuddy/mapper.py	Modified gather_file_paths to return separate lists for regular files and ZIP archives
src/sumbuddy/hasher.py	Enhanced checksum_file to handle both file paths and file-like objects
src/sumbuddy/archive.py	New ArchiveHandler class for ZIP file processing with streaming support
src/sumbuddy/main.py	Updated get_checksums to process both regular files and ZIP archives
README.md	Added documentation for ZIP support feature

Comments suppressed due to low confidence (1)

tests/test_mapper.py:71

The test_gather_file_paths_empty method needs to be updated to handle the new tuple return format from gather_file_paths. The test should verify both regular_files and zip_archives are returned as empty lists.

    def test_gather_file_paths_empty(self):

Copilot · 2025-07-21T12:19:29Z

src/sumbuddy/mapper.py

        Generate list of file paths in the input directory based on ignore pattern rules.
-
-        Parameters:
-        ------------
-        input_directory - String. Directory to traverse for files.
-        ignore_file - String [optional]. Filepath for the ignore patterns file.
-        include_hidden - Boolean [optional]. Whether to include hidden files.
-
-        Returns:
-        ---------
-        file_paths - List. Files in input_directory that are not ignored.
+        Returns a tuple: (regular_files, zip_archives)


The docstring should be updated to include the complete parameter and return documentation that was removed. The current docstring is incomplete and missing parameter descriptions and detailed return value information.

Yes, let's keep the inputs back in the format they were in previously

Copilot · 2025-07-21T12:19:29Z

src/sumbuddy/hasher.py

+                for chunk in iter(lambda: f.read(4096), b""):
+                    hash_func.update(chunk)
+        else:
+            # Assume it's a file-like object


The file-like object handling should include error handling for cases where the object doesn't support the read() method. Consider adding a try-except block or type checking to provide a clearer error message.

Suggested change

# Assume it's a file-like object

# Validate that the object supports the 'read' method

if not hasattr(file_path_or_obj, 'read'):

raise TypeError("The provided object is not a valid file-like object. It must support the 'read()' method.")

I'm having a hard time concocting a scenario where not hasattr(file_path_or_obj, 'read') would be true. The gather_file_paths and ArchiveHandler.stream_zip() should always return something that can be read. The only thing I can think of is if someone tries to manually do something like checksum_file(321) which would give a pretty clear error already of AttributeError: 'int' object has no attribute 'read'. I think we can leave this one alone for now.

Copilot · 2025-07-21T12:19:30Z

src/sumbuddy/archive.py

+    def __init__(self):
+        self.temp_dir = None
+
+    def process_zip(self, zip_path, root_dir):


The process_zip method creates temporary directories but the cleanup responsibility is unclear. Consider using a context manager or ensuring cleanup is called consistently to prevent temporary directory accumulation.

Actually, with stream_zip in place, process_zip can be removed.

Copilot · 2025-07-21T12:19:30Z

src/sumbuddy/__main__.py

-            checksum = hasher.checksum_file(file_path, algorithm=algorithm, length=length)
-            writer.writerow([file_path, os.path.basename(file_path), checksum])
-
+        total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives)


The total_files calculation iterates through all ZIP archives twice - once to count files for the progress bar and again during processing. This is inefficient for large ZIP files. Consider caching the file counts or restructuring the progress tracking.

Suggested change

total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives)

total_files = len(regular_files) + len(zip_archives) # Start with regular files and ZIP archives themselves

Non-blocking for now. Can add a new issue to use file content headers and perhaps use more generic "Archive" language in __main__.py rather than referring to zip once a new archive file type is supported.

thompsonmj · 2025-07-21T13:57:44Z

@coolnipunj The zip streaming looks great!
Final change request:

Add the docstrings back to gather_file_paths in mapper.py.
Remove the now unused process_zip function from archive.py

coolnipunj and others added 7 commits June 13, 2025 14:59

Add zip support feature

e54368c

Include archive.py in the package

4e5dcd9

Remove unused import from archive.py

63c64f7

Update src/sumbuddy/mapper.py

dfe3df3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add zip archive test file and update README to document zip test cove…

e6518bd

…rage

Fix linter errors: remove unused imports from test_archive.py

3f171f0

coolnipunj requested a review from thompsonmj June 18, 2025 21:13

hlapp requested a review from Copilot June 18, 2025 21:18

This comment was marked as outdated.

Sign in to view

thompsonmj requested changes Jun 23, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

coolnipunj added 2 commits June 24, 2025 14:50

Refactor ZIP support: in-memory streaming, modular archive handling, …

edd8e65

…updated tests and docs

Fix linter errors: remove unused import and obsolete code, keep only …

7d9df08

…streaming ZIP logic

thompsonmj requested a review from Copilot July 21, 2025 12:18

Copilot AI reviewed Jul 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/zip support #29

Feature/zip support #29

Uh oh!

coolnipunj commented Jun 18, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

thompsonmj left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

thompsonmj Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

thompsonmj Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

thompsonmj Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

thompsonmj Jul 21, 2025

Uh oh!

thompsonmj commented Jul 21, 2025

Uh oh!

Uh oh!

-            # Assume it's a file-like object
+            # Validate that the object supports the 'read' method
+            if not hasattr(file_path_or_obj, 'read'):
+                raise TypeError("The provided object is not a valid file-like object. It must support the 'read()' method.")

	total_files = len(regular_files) + sum(1 for z in zip_archives for _ in ArchiveHandler.stream_zip(z)) + len(zip_archives)
	total_files = len(regular_files) + len(zip_archives) # Start with regular files and ZIP archives themselves

Feature/zip support #29

Are you sure you want to change the base?

Feature/zip support #29

Uh oh!

Conversation

coolnipunj commented Jun 18, 2025

Description

Implementation Details

Testing

Design Decisions

Uh oh!

This comment was marked as outdated.

Uh oh!

thompsonmj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

thompsonmj Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

thompsonmj Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

thompsonmj Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

thompsonmj Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

thompsonmj commented Jul 21, 2025

Uh oh!

Uh oh!