-
Notifications
You must be signed in to change notification settings - Fork 722
Use file content heuristics to decide file reader. #1962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Dimi1010
wants to merge
65
commits into
seladb:dev
Choose a base branch
from
Dimi1010:feature/heuristic-file-selection
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 58 commits
Commits
Show all changes
65 commits
Select commit
Hold shift + click to select a range
02de760
Added heuristics file content detector that determines the content ba…
Dimi1010 d2b6339
Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…
Dimi1010 685dd9f
Moved stream checkpoint outside format detector as it is not directly…
Dimi1010 40dee69
Added a new factory function `createReader` that uses the new heurist…
Dimi1010 f1e3e18
Add <algorithm> include.
Dimi1010 8da1790
Added unit tests.
Dimi1010 3ad51e2
Deprecated old factory function.
Dimi1010 15c2000
Add byte-swapped zstd magic number.
Dimi1010 17af8d4
Lint
Dimi1010 46418ec
Move enum closer to first usage.
Dimi1010 3d713ab
Added unit tests for file reader device factory.
Dimi1010 a2391ec
Revert indentation.
Dimi1010 ea328d7
Fixed StreamCheckpoint to restore original stream state.
Dimi1010 db86c3e
Merge branch 'dev' into feature/heuristic-file-selection
Dimi1010 4aed9bd
Merge branch 'dev' into feature/heuristic-file-selection
Dimi1010 a83ae2b
Moved isStreamSeekable helper to inside `CaptureFileFormatDetector`.
Dimi1010 916e872
Added pcap magic number for Alexey Kuznetzov's modified pcap format.
Dimi1010 022529f
Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…
Dimi1010 169fcd2
Split the unit test into multiple smaller tests.
Dimi1010 db8c848
Merge branch 'dev' into feature/heuristic-file-selection
Dimi1010 3e74912
Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…
Dimi1010 f1613c4
Added helper to indicate if ZstSupport is enabled for PcapNg devices.
Dimi1010 bc2bacd
Split pcap microsecond and nanosecond file heuristics tests.
Dimi1010 58ac45d
Skipping Zst test case if zst is not supported.
Dimi1010 3b4b5ad
Due to file heuristics returning PcapNG format on Zstd archive, if Zs…
Dimi1010 18379b4
Lint
Dimi1010 8a4f6f8
Added invalid device factory to pcap tag.
Dimi1010 7776e0e
Updated static zst archives to be actual archives.
Dimi1010 4f52f59
Centralized PTF test name width under a macro.
Dimi1010 88ebfff
Add Pcap++Test header files to test sources for IDE tooling.
Dimi1010 41fe188
Fixed test output formatting.
Dimi1010 c8ae4f8
Lint
Dimi1010 c7cab2b
Typo fix.
Dimi1010 6d55077
Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…
Dimi1010 682eeac
Shortened test names.
Dimi1010 07804da
Simplified invalid file test.
Dimi1010 9c4fc08
Simplified ZST tests.
Dimi1010 d975157
Added snoop test.
Dimi1010 40530df
Expanded granularity of file format detection.
Dimi1010 96a61b2
Marked `checkSupport` functions as constexpr to enable compile time o…
Dimi1010 55a6b7a
Exclude json from pre-commit cppcheck as it is slow due to many defin…
Dimi1010 3ab14e7
Lint
Dimi1010 5dd9a30
Fix runtime side effects inside constexpr function.
Dimi1010 45ad769
Added a secondary factory function to separate mixed error handling m…
Dimi1010 d24a9ad
Revert deprecation message, as doxygen is unhappy.
Dimi1010 f5ff879
Update tests.
Dimi1010 2c1b2c4
Update deprecation warning to point to the function closer to the sig…
Dimi1010 8d1ed1d
Catch general exception instead of runtime error.
Dimi1010 0ea2da9
Shortened deprecation message due to pre-commit warnings when its is …
Dimi1010 c209c90
Fix braces.
Dimi1010 8d77aa0
Simplfy test.
Dimi1010 af12d2f
Added tests for createReader failures.
Dimi1010 b357087
Merge branch 'dev' into feature/heuristic-file-selection
Dimi1010 443c883
Simplified pcap detection to not require to read the entire pcap header.
Dimi1010 202d5cc
Added const qualifiers to detector methods.
Dimi1010 e6b2aa9
Added dedicated unit tests for CaptureFileFormatDetector.
Dimi1010 b8fb635
Added more tests for `createReader`.
Dimi1010 c6c7720
Add static assert for array indice checks.
Dimi1010 181a8b4
Updated detectPcap selection.
Dimi1010 76aa850
Merge branch 'dev' into feature/heuristic-file-selection
Dimi1010 d8f7419
Extracted capture format detector to remove it from publicly availabl…
Dimi1010 91a7a0a
Fix includes.
Dimi1010 e275950
Removed duplicate files from tracking.
Dimi1010 e7a42b5
Lint
Dimi1010 54f7bae
Trimmed pcapng sample.
Dimi1010 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, this used to be in the
.cpp
file, right? Is the reason we moved it to the.h
file is to make it easier to test?If yes, I think we can test it using
createReader()
- create a temporary fake file with the data we want to test, and delete it when the test is doneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that suggestion initially, but it would have been an extremely fragile unit test. The "pass" conditions would have been checked indirectly.
Also,
createReader
has multiple return paths for Nano / Zst file formats, which would have caused complications since the format test would have needed to care about the environment it runs at, which it doesn't have to as a standalone.Any additional changes to
createReader
could also break the test, which they really shouldn't. For example, I am thinking of maybe adding additional logic for Zst archive to check if the compressed data is actually a pcapng, and not a random file. This would be a nightmare to make compatible with the "spoofed files" test due to assumptions on the test thatcreateReader
doesn't do anything more complicated than check the initial magic number.So, in the end, you end up with a more compilcated unit test to read through that:
createReader
factory, too.createReader
as it uses its behavior to testdetectFormat
.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand it's better to test
CaptureFileFormatDetector
as a standalone class, but it requires exposing it in the.h
file which is not great (even though it's in theinternal
namespace). TestingcreateReader
is a bit more fragile, but I don't think the difference is that big. Of course, if we add logic to detect more file types or update the existing detection logic some tests might break, but we easily fix them as needed.I usually try to avoid the
internal
namespace where possible because it's still in the.h
file and is exposed to users, and we'd like to keep our API as clean as possibleThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a big difference and it's not always an easy fix. I plan to add the aforementioned Zst checks in another PR after this one, and that would make zst spoofing in
createReader
impossible, due to zst format automatically being checked for PcapNg or Unknown contents. Therefor you can't rely on the return ofcreateReader
to find out what the return ofdetectFormat
was, becausenullptr
can be returned from several paths fromdetectFormat
return value (Unknown, Nano + unsupported, Zst + unsupported). We have already had issues with tests being silently broken (#1977 comes to mind), so I would prefer to avoid fragile tests if we can.Fair, it is exposed, but the that is the entire reason of having the
internal
namespace. It is a common convention that external users shouldn't really touch it. If you want to keep the primary public header files clean there are a couple options:internal
/detail
in their public include folder, where they keep all their internal code headers that need to be exposed. That keeps the "internal" code separate from the "public" code, if users want to read through the headers. This is a common convention used in Boost libraries. "public" headers that depend on internal headers include them from theinternal
subfolder.CaptureFileFormatDetector
is only needed in thecpp
part and not in the header part, we can extract it to a fully internal header, kept with the source files. This would prevent it from being exposed in the public API, but the Test project can be manually set to search for headers from "Pcap++/src" too, to allow it to link in the tests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand... if we create fake files we know which type to expect, so all the test needs to do is verify the created file device is of the expected type 🤔
I guess we can do that, but I still don't understand why we can't test it with
createReader
ortryCreateReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the positive scenarios are the more important ones. The way I see it, tho, the positive scenarios are covered by the combination of the two unit tests, aren't they?
Magic number variations for the file types are covered by their standalone unit test.
File unit test use a fully valid sample file to validate the output of the
createReader
for a certain file type. The precice variation of the Pcap magic number doesn't matter for this context as it is covered by the standalone test. It only matters that it detects the Pcap format, and the factory creates aPcapLiveDevice
.I suppose the way I see it, for
createReader
the file format detection is an external implementation detail that should be assumed it works correctly for all the context.In summary, my viewpoint is that:
createReader
tests should focus on validating the behavior of the function on a sample of a type: (Pcap, PcapNG, Snoop, Unknown, etc.). Essentially, every case of theswitch
statement should be covered once.detectFormat
tests should focus on correctly detecting every possible way a sample can be considered "valid" or "invalid" (e.g., magic number variations, etc).Both tests should not cocern themselves with the responsibilities of the other as that leads to unnecessary duplication.
PS: Tbh, I consider every "valid" fake file a bug. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other languages like Python or JavaScript it's easier to test code that is considered private or protected (mostly because these languages don't have such a concept 🙂 ). However, for languages that enforce stricter access control, like C++ and Java, we sometimes need to make compromises in tests.
What guides me is that if I need to make significant changes just to allow testing a module, this is usually not a great idea, especially if I can find a workaround. In our case, especially with what we want to test, I don't think it's worth moving
CaptureFileFormatDetector
to a different file just for testsUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, I think that unit tests should mostly validate public API for an object. Not
private
, as that is compilated.For me what guides me is the Single Responsibility Principle:
createReader
factory has the sole responsibility of creating the correct reader for a given file format.CaptureFileFormatDetector::detectFormat
has the sole responsibility of determining the file format of a given byte stream.The file format detector might not be public to the library's users, but it is external to the
createReader
factory. The factory outsources the detection logic to the format detector and depends on it, but it should not be tightly integrated with it, and as such testing the format detector's correctness should not fall to the reader factory's unit tests.Moving the format detector to a different file for tests is a relatively minor change, IMO. It is the same as moving the
ObjectPool
to a different file fromLogger
even though it is only used there, no? The only difference is that I initially made it local to thePcapFileDevice.cpp
as it didn't have test then.Having it as a non-public header does make it a bit harder to include in the tests project, but it is a minor workaround, as it would only be included once for its tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 statements sometimes contradict, and this is a good example: on one hand
createReader
is the public API which you claim we should test, whileCaptureFileFormatDetector
is considered private and shouldn't be tested, on the other hand, the Single Responsibility Principle dictates testing this "private" code. However, since C++ doesn't allow accessing private/protected methods, I usually examine it case-by-case. In this case I think it's ok to only testcreateReader
because it coversCaptureFileFormatDetector
wellObjectPool
is different because it's not specific to theLogger
and can be reused elsewhere.CaptureFileFormatDetector
is very specific to this file, so extracting it to its own.h
and.cpp
files just for the sake of testing seems redundant to meThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes they can, yes, but key word here is "for an object". What is public depends on the context you are looking at:
Generally, unit tests should validate code behaviour in the most isolated case that can be achieved (that isn't a trivial case) to keep them simple and so you can easily locate the error. That means that cases should be isolated enough that, in the case of failure to show the smallest reasonable region of where the error is, not just that the error exists (example below). As such, the context most looked at should be option 2 as that generally keeps the individual tests more isolated and on the smaller size. The format detector's logic is complex enough that it justifies for as a separate unit.
Under option 2,
CaptureFileFormatDetector
is not considered "private" but a separate object that must be validated independently as it is not trivial.In this case, we have no such issues, as the format detector is a separate reusable object, and not a private method of
IFileReaderDevice
. The only issue is that of linkage, as the tests should be able to link to the format detector. C++ allows that naturally by having it as external linkage, but you don't want it exposed to the public API (as it is rightfully not needed for the users), therefor separate header that is considered "internal", not "private".In the current case "I think that unit tests should mostly validate public API for an object" means that the format detector should only have a unit test for
detectFormat
and not for all the implementations of private methodsdetectPcap
,isPcapNG
, etc.Here is an example:
In the above case, you have 2 logical independent reusable units: Bar and Foo.
Foo's unit tests should concert themselves only with validating the input / output behaviour of
doFoo
and other public methods ofFoo
. It should not concern itself with validatingbar()
's behaviour, because:bar()
is testing the precise internal implementation ofdoFoo()
, instead of just validating externally observable behaviour.doFoo(...)
's valid range is [5, IntMax], whilebar(...)
's valid range is [0, IntMax]. How do you test the other values?For the aforementioned statement:
If
Foo
tests fail, butBar
tests pass, that means that the error is in Foo, if onlyFoo
tests existed, then we would need to also go look throughbar(...)
since we wouldn't know that it runs fine.In summary:
Foo
's unit tests should treat everything underFoo::doFoo(int a)
as a black box, and only validate the externally observable behaviour.Foo
should treatbar(...)
as a black box that does what it says on the tin can.bar(int a)
(so it actually does what it says it does) and treatbar
's implementation of how it achieves that as a black box.Now replace
Foo::doFoo
withcreateReader
andbar
withdetectFormat
.It does not cover it well, considering we would need to resort to workarounds such as creating nominally invald fake files that should realistically be discarded by the factory validation, as they can't produce a valid reader device.
The format detector might work with them, because of its current implementation only scans the first 4 to 6 bytes, but that does not mean that the reader factory should.
Realistically, after the format is determined and a device is created, the factory should attempt to open the file to read it to validate it can before returning the reader (and optionally closing it). It would not be able to do that, if it has to accept those fake files for the purposes of testing something that should be considered a black box that "just works" from its perspective. Allowing those "fake" files locks us out of any other potential validation we might be able to do on the files, when there are cleaner solutions that don't lock us out of it. That or we now need to modify the fake files again, which is also more work that does not happen with separate unit tests.
I agree that it was specifically made for the factory's needs, but not that it is specific to it.
The format detector is perfectly viable to be viewed as a black box from the factory's perspective. File contents go in, format enum value goes out. That same black box can be reused elsewhere if needed, making it independent from the factory's implementation.