Skip to content

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Jul 25, 2025

For RNTuples that have been written with same-page merging enabled, the pages that were aliased (i.e., repeated from a the previous identical page) should not be considered in the calculation of the total RNTuple size.

@enirolf enirolf requested review from hahnjo and silverweed July 25, 2025 08:23
@enirolf enirolf self-assigned this Jul 25, 2025
@enirolf enirolf requested a review from jblomer as a code owner July 25, 2025 08:23
locatorOffset = page.GetLocator().GetType() == ROOT::RNTupleLocator::ELocatorType::kTypeDAOS
? page.GetLocator().GetPosition<RNTupleLocatorObject64>().GetLocation()
: page.GetLocator().GetPosition<std::uint64_t>();
if (locatorOffset == prevLocatorOffset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that pageRange.GetPageInfos() is not guaranteed to return the pages in sorted order, so you might miss some duplicates here if they are not adjacent. Usually they will be but it's not a guarantee (and I observed this is the rntviewer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather, the entire set of pages returned by all the calls to GetPageInfos() is not necessarily sorted. That said, I'm not sure this has implication for detecting duplicate pages...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has, we are also deduplicating pages between page ranges of different columns. This might even be the primary case, for example for two index columns of two collections that always have the same number of elements.

locatorOffset = page.GetLocator().GetType() == ROOT::RNTupleLocator::ELocatorType::kTypeDAOS
? page.GetLocator().GetPosition<RNTupleLocatorObject64>().GetLocation()
: page.GetLocator().GetPosition<std::uint64_t>();
if (locatorOffset == prevLocatorOffset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has, we are also deduplicating pages between page ranges of different columns. This might even be the primary case, for example for two index columns of two collections that always have the same number of elements.

? page.GetLocator().GetPosition<RNTupleLocatorObject64>().GetLocation()
: page.GetLocator().GetPosition<std::uint64_t>();
if (locatorOffset == prevLocatorOffset)
continue;
nBytesOnStorage += page.GetLocator().GetNBytesOnStorage();
nBytesInMemory += page.GetNElements() * elementSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aliased pages should probably be counted towards nBytesInMemory because (at least for the moment) we actually load and decompress them multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here nBytesInMemory is only used to calculate and print the compression factor, so I think in this specific place it should not be counted actually. However, this is a good point to take into account for the size counting taking place in the inspector, I will change it there.

Copy link

github-actions bot commented Jul 25, 2025

Test Results

    21 files      21 suites   3d 11h 24m 54s ⏱️
 3 218 tests  3 217 ✅ 0 💤 1 ❌
65 855 runs  65 852 ✅ 0 💤 3 ❌

For more details on these failures, see this check.

Results for commit 7a2ced9.

♻️ This comment has been updated with latest results.

@enirolf enirolf force-pushed the ntuple-inspector-size branch from 93677a4 to c35859e Compare July 25, 2025 12:40
enirolf added 2 commits July 25, 2025 14:46
For RNTuples that have been written with same-page merging enabled, the
pages that were aliased (i.e., repeated from a the previous identical
page) should not be considered in the calculation of the total on-disk
RNTuple size.
For RNTuples that have been written with same-page merging enabled, the
pages that were aliased (i.e., repeated from a the previous identical
page) should not be considered in the calculation of the total RNTuple
size.
@enirolf enirolf force-pushed the ntuple-inspector-size branch 2 times, most recently from cdbbd5a to ec64d29 Compare July 25, 2025 13:16
@enirolf enirolf requested review from hahnjo and silverweed July 25, 2025 13:16
@enirolf enirolf force-pushed the ntuple-inspector-size branch from ec64d29 to 7a2ced9 Compare July 25, 2025 14:14
Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good to me, some suggestions for the accounting details.

@@ -60,6 +60,7 @@ void ROOT::Experimental::RNTupleInspector::CollectColumnInfo()
// to report the size _in memory_ of column elements.
std::uint32_t elemSize = RColumnElementBase::Generate(colDesc.GetType())->GetSize();
std::uint64_t nElems = 0;
std::unordered_set<std::uint64_t> seenPages{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move the map down into the first level of the for loop because we don't deduplicate pages across clusters. This will also limit the memory used for the map.

auto [_, pageAdded] = seenPages.emplace(locatorOffset);
if (pageAdded) {
nBytesOnStorage += page.GetLocator().GetNBytesOnStorage();
nBytesInMemory += page.GetNElements() * elementSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the in-memory summary, we should not take into account the page deduplication (because we don't deduplicate in memory).

nBytesOnStorage += page.GetLocator().GetNBytesOnStorage();
nBytesInMemory += page.GetNElements() * elementSize;
clusters[idx].fNBytesOnStorage += page.GetLocator().GetNBytesOnStorage();
clusters[idx].fNBytesInMemory += page.GetNElements() * elementSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think I'd always sum up the in-memory size

clusters[idx].fNBytesOnStorage += page.GetLocator().GetNBytesOnStorage();
clusters[idx].fNBytesInMemory += page.GetNElements() * elementSize;
++clusters[idx].fNPages;
info.fNBytesOnStorage += page.GetLocator().GetNBytesOnStorage();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it gets tricky. As is, deduplicated pages count for none of the columns. I'd rather error the other way and account for all of the pages (aliased or not) for the per column statistics. If we want to do it correctly, I think we need to keep a seenPages map per column and then count deduplicated pages only once. The meaning of this would be "if I only stored this column and nothing else on disk, this is what it takes".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants