-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[ntuple] Account for aliased pages in on-disk size calculation #19460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
locatorOffset = page.GetLocator().GetType() == ROOT::RNTupleLocator::ELocatorType::kTypeDAOS | ||
? page.GetLocator().GetPosition<RNTupleLocatorObject64>().GetLocation() | ||
: page.GetLocator().GetPosition<std::uint64_t>(); | ||
if (locatorOffset == prevLocatorOffset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that pageRange.GetPageInfos()
is not guaranteed to return the pages in sorted order, so you might miss some duplicates here if they are not adjacent. Usually they will be but it's not a guarantee (and I observed this is the rntviewer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather, the entire set of pages returned by all the calls to GetPageInfos()
is not necessarily sorted. That said, I'm not sure this has implication for detecting duplicate pages...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has, we are also deduplicating pages between page ranges of different columns. This might even be the primary case, for example for two index columns of two collections that always have the same number of elements.
locatorOffset = page.GetLocator().GetType() == ROOT::RNTupleLocator::ELocatorType::kTypeDAOS | ||
? page.GetLocator().GetPosition<RNTupleLocatorObject64>().GetLocation() | ||
: page.GetLocator().GetPosition<std::uint64_t>(); | ||
if (locatorOffset == prevLocatorOffset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has, we are also deduplicating pages between page ranges of different columns. This might even be the primary case, for example for two index columns of two collections that always have the same number of elements.
? page.GetLocator().GetPosition<RNTupleLocatorObject64>().GetLocation() | ||
: page.GetLocator().GetPosition<std::uint64_t>(); | ||
if (locatorOffset == prevLocatorOffset) | ||
continue; | ||
nBytesOnStorage += page.GetLocator().GetNBytesOnStorage(); | ||
nBytesInMemory += page.GetNElements() * elementSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aliased pages should probably be counted towards nBytesInMemory
because (at least for the moment) we actually load and decompress them multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here nBytesInMemory
is only used to calculate and print the compression factor, so I think in this specific place it should not be counted actually. However, this is a good point to take into account for the size counting taking place in the inspector, I will change it there.
Test Results 21 files 21 suites 3d 11h 24m 54s ⏱️ For more details on these failures, see this check. Results for commit 7a2ced9. ♻️ This comment has been updated with latest results. |
93677a4
to
c35859e
Compare
For RNTuples that have been written with same-page merging enabled, the pages that were aliased (i.e., repeated from a the previous identical page) should not be considered in the calculation of the total on-disk RNTuple size.
For RNTuples that have been written with same-page merging enabled, the pages that were aliased (i.e., repeated from a the previous identical page) should not be considered in the calculation of the total RNTuple size.
cdbbd5a
to
ec64d29
Compare
ec64d29
to
7a2ced9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good to me, some suggestions for the accounting details.
@@ -60,6 +60,7 @@ void ROOT::Experimental::RNTupleInspector::CollectColumnInfo() | |||
// to report the size _in memory_ of column elements. | |||
std::uint32_t elemSize = RColumnElementBase::Generate(colDesc.GetType())->GetSize(); | |||
std::uint64_t nElems = 0; | |||
std::unordered_set<std::uint64_t> seenPages{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move the map down into the first level of the for loop because we don't deduplicate pages across clusters. This will also limit the memory used for the map.
auto [_, pageAdded] = seenPages.emplace(locatorOffset); | ||
if (pageAdded) { | ||
nBytesOnStorage += page.GetLocator().GetNBytesOnStorage(); | ||
nBytesInMemory += page.GetNElements() * elementSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the in-memory summary, we should not take into account the page deduplication (because we don't deduplicate in memory).
nBytesOnStorage += page.GetLocator().GetNBytesOnStorage(); | ||
nBytesInMemory += page.GetNElements() * elementSize; | ||
clusters[idx].fNBytesOnStorage += page.GetLocator().GetNBytesOnStorage(); | ||
clusters[idx].fNBytesInMemory += page.GetNElements() * elementSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I think I'd always sum up the in-memory size
clusters[idx].fNBytesOnStorage += page.GetLocator().GetNBytesOnStorage(); | ||
clusters[idx].fNBytesInMemory += page.GetNElements() * elementSize; | ||
++clusters[idx].fNPages; | ||
info.fNBytesOnStorage += page.GetLocator().GetNBytesOnStorage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it gets tricky. As is, deduplicated pages count for none of the columns. I'd rather error the other way and account for all of the pages (aliased or not) for the per column statistics. If we want to do it correctly, I think we need to keep a seenPages
map per column and then count deduplicated pages only once. The meaning of this would be "if I only stored this column and nothing else on disk, this is what it takes".
For RNTuples that have been written with same-page merging enabled, the pages that were aliased (i.e., repeated from a the previous identical page) should not be considered in the calculation of the total RNTuple size.