-
Notifications
You must be signed in to change notification settings - Fork 721
Use file content heuristics to decide file reader. #1962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…sed on the magic number.
…ics detection method.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev #1962 +/- ##
==========================================
+ Coverage 83.41% 83.45% +0.03%
==========================================
Files 311 312 +1
Lines 55026 55288 +262
Branches 11777 12149 +372
==========================================
+ Hits 45900 46139 +239
- Misses 7868 7883 +15
- Partials 1258 1266 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Tests/Pcap++Test/Tests/FileTests.cpp
Outdated
PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader)); | ||
PTF_ASSERT_TRUE(genericReader->open()); | ||
// ------- IFileReaderDevice::createReader() Factory | ||
// TODO: Move to a separate unit test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the following to get more coverage:
- Open a snoop file
- Open a file that is not any of the options
- Open pcap files with different magic numbers
- Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3d713ab adds the following tests:
- Pcap, PcapNG, Zst file with correct content + extension
- Pcap, PcanNG file with correct content + wrong extension
- Bogus content file with correct extension (pcap, pcapng, zst)
- Bogus content file with wrong extension (txt)
Haven't found a snoop file to add. Do we have any?
Open pcap files with different magic numbers
Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.
Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
Pending on #1962 (comment) .
Move it out if it needs to be reused somewhere.
Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.
@Dimi1010 some CI tests fail... |
…on 1 line and doxygen errors when its in 2 lines.
Pcap++/header/PcapFileDevice.h
Outdated
enum class CaptureFileFormat | ||
{ | ||
Unknown, | ||
Pcap, // regular pcap with microsecond precision | ||
PcapNano, // regular pcap with nanosecond precision | ||
PcapNG, // uncompressed pcapng | ||
PcapNGZstd, // zstd compressed pcapng | ||
Snoop, // solaris snoop | ||
}; | ||
|
||
/// @brief Heuristic file format detector that scans the magic number of the file format header. | ||
class CaptureFileFormatDetector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, this used to be in the .cpp
file, right? Is the reason we moved it to the .h
file is to make it easier to test?
If yes, I think we can test it using createReader()
- create a temporary fake file with the data we want to test, and delete it when the test is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that suggestion initially, but it would have been an extremely fragile unit test. The "pass" conditions would have been checked indirectly.
Also, createReader
has multiple return paths for Nano / Zst file formats, which would have caused complications since the format test would have needed to care about the environment it runs at, which it doesn't have to as a standalone.
Any additional changes to createReader
could also break the test, which they really shouldn't. For example, I am thinking of maybe adding additional logic for Zst archive to check if the compressed data is actually a pcapng, and not a random file. This would be a nightmare to make compatible with the "spoofed files" test due to assumptions on the test that createReader
doesn't do anything more complicated than check the initial magic number.
So, in the end, you end up with a more compilcated unit test to read through that:
- depends on the environment it runs on.
- can be broken not just by changes to the format detector but also changes to the
createReader
factory, too. - induces requirements on
createReader
as it uses its behavior to testdetectFormat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand it's better to test CaptureFileFormatDetector
as a standalone class, but it requires exposing it in the .h
file which is not great (even though it's in the internal
namespace). Testing createReader
is a bit more fragile, but I don't think the difference is that big. Of course, if we add logic to detect more file types or update the existing detection logic some tests might break, but we easily fix them as needed.
I usually try to avoid the internal
namespace where possible because it's still in the .h
file and is exposed to users, and we'd like to keep our API as clean as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing createReader is a bit more fragile, but I don't think the difference is that big. Of course, if we add logic to detect more file types or update the existing detection logic some tests might break, but we easily fix them as needed.
It is a big difference and it's not always an easy fix. I plan to add the aforementioned Zst checks in another PR after this one, and that would make zst spoofing in createReader
impossible, due to zst format automatically being checked for PcapNg or Unknown contents. Therefor you can't rely on the return of createReader
to find out what the return of detectFormat
was, because nullptr
can be returned from several paths from detectFormat
return value (Unknown, Nano + unsupported, Zst + unsupported). We have already had issues with tests being silently broken (#1977 comes to mind), so I would prefer to avoid fragile tests if we can.
I usually try to avoid the internal namespace where possible because it's still in the .h file and is exposed to users, and we'd like to keep our API as clean as possible
Fair, it is exposed, but the that is the entire reason of having the internal
namespace. It is a common convention that external users shouldn't really touch it. If you want to keep the primary public header files clean there are a couple options:
- I have seen many libraries have a subfolder
internal
/detail
in their public include folder, where they keep all their internal code headers that need to be exposed. That keeps the "internal" code separate from the "public" code, if users want to read through the headers. This is a common convention used in Boost libraries. "public" headers that depend on internal headers include them from theinternal
subfolder. - In the current case, we have another option. Since the
CaptureFileFormatDetector
is only needed in thecpp
part and not in the header part, we can extract it to a fully internal header, kept with the source files. This would prevent it from being exposed in the public API, but the Test project can be manually set to search for headers from "Pcap++/src" too, to allow it to link in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a big difference and it's not always an easy fix. I plan to add the aforementioned Zst checks in another PR after this one, and that would make zst spoofing in
createReader
impossible, due to zst format automatically being checked for PcapNg or Unknown contents. Therefor you can't rely on the return ofcreateReader
to find out what the return ofdetectFormat
was, becausenullptr
can be returned from several paths fromdetectFormat
return value (Unknown, Nano + unsupported, Zst + unsupported). We have already had issues with tests being silently broken (#1977 comes to mind), so I would prefer to avoid fragile tests if we can.
I'm not sure I understand... if we create fake files we know which type to expect, so all the test needs to do is verify the created file device is of the expected type 🤔
- In the current case, we have another option. Since the
CaptureFileFormatDetector
is only needed in thecpp
part and not in the header part, we can extract it to a fully internal header, kept with the source files. This would prevent it from being exposed in the public API, but the Test project can be manually set to search for headers from "Pcap++/src" too, to allow it to link in the tests.
I guess we can do that, but I still don't understand why we can't test it with createReader
or tryCreateReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand... if we create fake files we know which type to expect, so all the test needs to do is verify the created file device is of the expected type 🤔
The issue is that the solution relies on testing intermediate state based on end results, and the end results do not guarantee the intermediate state.
In a standalone case you have a much simpler flow:
sequenceDiagram
participant UnitTest
participant CaptureFormatDetector
Note over UnitTest: Test setup
UnitTest ->> CaptureFormatDetector: detectFormat()
activate CaptureFormatDetector
CaptureFormatDetector-->>UnitTest : Format Enum
deactivate CaptureFormatDetector
alt Expected value
Note over UnitTest: Test success
else Unexpected value
Note over UnitTest: Test failure
end
In a situation where a spoofed file is used to test through createReader
you have a much more complicated flow:
sequenceDiagram
participant UnitTest
participant ReaderFactory
participant CaptureFormatDetector
participant Runtime
Note over UnitTest: Test setup
UnitTest ->> ReaderFactory: createReader()
activate ReaderFactory
ReaderFactory ->> ReaderFactory: open file
break File open error
ReaderFactory -->> UnitTest: nullptr
end
ReaderFactory ->> CaptureFormatDetector: detectFormat()
activate CaptureFormatDetector
CaptureFormatDetector-->>ReaderFactory: Format Enum
deactivate CaptureFormatDetector
alt Pcap
ReaderFactory -->> UnitTest: PcapDevice
else PcapNano
ReaderFactory ->> Runtime: checkNanoSupport()
activate Runtime
Runtime -->> ReaderFactory: Runtime ZST support boolean.
deactivate Runtime
alt Supported
ReaderFactory -->> UnitTest: PcapDevice
else Unsupported
ReaderFactory -->> UnitTest: nullptr
end
else PcapNG
ReaderFactory -->> UnitTest: PcapNGDevice
else PcapNG Zst
ReaderFactory ->> Runtime: checkZstdSupport()
activate Runtime
Runtime -->> ReaderFactory: Runtime ZST support boolean.
deactivate Runtime
alt Supported
opt Additional Zst checks
Note right of ReaderFactory: Potential additional validation for ZST to drop random archives.<br/>This is to avoid creating devices from archives which can't be read by LightPcapNG.
ReaderFactory ->> ReaderFactory: unpack first Zst segment
break Zst unpack failed
ReaderFactory -->> UnitTest: nullptr
end
ReaderFactory -->> CaptureFormatDetector: detectFormat() on unpacked segment
activate CaptureFormatDetector
CaptureFormatDetector-->>ReaderFactory: Inner content format Enum
deactivate CaptureFormatDetector
break Format is not PcapNG
ReaderFactory -->> UnitTest: nullptr
end
end
ReaderFactory -->> UnitTest: PcapNGDevice
else Unsupported
ReaderFactory -->> UnitTest: nullptr
end
else Snoop
ReaderFactory -->> UnitTest: SnoopDevice
else Unknown
ReaderFactory -->> UnitTest: nullptr
end
deactivate ReaderFactory
Note over UnitTest: TestValidation?
Unit testing the createReader
overall is all well and good, but unit testing specifically the intermediate result of CaptureFormatDetector::detectFormat
is a lot more brittle this way. Additionally, a test failure can be generated even with valid detectFormat
behaviour, due to external factors, like code changes completely unrelated to the format detection infrastructure that is supposed to be unit tested. There are several branches that return nullptr
that are unrelated to the detectFormat
result.
Some examples:
PcapNano
format unit test is now suddenly dependent on runtime support, even though thedetectFormat
is not, and it should work on all runtimes.ZstArchive
format unit test would also be dependent on runtime support, even though it should be detected on all runtimes.Additional Zst checks
would make unit testing a spoofed ZST archive format impossible, as it would fail unpacking.
It overall makes the unit test a lot more complicated to reason about with a lot of corner cases you need to cover. Ultimately you are left, not with a unit test for detectFormat
, but with a bad unit test for createReader
. General best practices are to keep unit tests simple and isolated to the function you are testing.
Hope that helps clear it up. 🙂
PS: TIL Github issues support UML diagrams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d8f7419 Updated to separate header / source under Pcap++/src
} | ||
}; | ||
|
||
PTF_TEST_CASE(TestFileFormatDetector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my previous comment. Maybe we can create a temp fake file with the expected data and run createReader()
The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.
The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.
A new function
createReader
andtryCreateReader
has been added due to changes in the public API of the factory.The functions differ in the error handling scheme, as
createReader
throws andtryCreateReader
returnsnullptr
on error.Method behaviour changes during erroneous scenarios:
getReader
createReader
tryCreateReader
nullptr
PcapFileDeviceReader
nullptr