Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
412 changes: 196 additions & 216 deletions Cargo.lock

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ categories = ["algorithms", "data-structures"]
enum_dispatch = "0.3.13"
serde = { version = "1.0", optional = true }
wyhash = "0.5.0"
mem_dbg = {version = "0.2.4", optional = true}

[dev-dependencies]
amadeus-streaming = "0.4.3"
Expand All @@ -34,7 +35,7 @@ name = "cardinality_estimator"
harness = false

[features]
default = []
default = ["mem_dbg"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to enable mem_dbg feature by default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if by the end of this PR we were to replace size_of with mem_dbg - the two methods are by definition very overlapping in scope. Possibly, reading the other comments, you intended size_of to only include the size of the stored object, while mem_dbg has the goal of computing the size of the object in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mem_dbg aims to show real heap memory usage similarly to size_of then it could be a good replacement / alternative to current size_of implementation.

with_serde = ["serde"]

[profile.release]
Expand Down
1 change: 1 addition & 0 deletions benches/cardinality_estimator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@ fn measure_error<E: CardinalityEstimatorTrait<usize>>(cardinality: usize) -> Str
}

#[derive(Tabled)]
#[cfg_attr(feature = "mem_dbg", derive(mem_dbg::MemDbg, mem_dbg::MemSize))]
struct StatRecord {
cardinality: usize,
cardinality_estimator: String,
Expand Down
14 changes: 13 additions & 1 deletion src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ const LEN_OFFSET: usize = 56;
/// Mask used for accessing heap allocated data stored at the pointer in `data` field.
const PTR_MASK: usize = ((1 << LEN_OFFSET) - 1) & !3;

#[cfg_attr(feature = "mem_dbg", derive(mem_dbg::MemDbg, mem_dbg::MemSize))]
/// Array representation container
pub(crate) struct Array<'a, const P: usize, const W: usize> {
/// Number of items stored in the array
Expand Down Expand Up @@ -110,7 +111,18 @@ impl<'a, const P: usize, const W: usize> RepresentationTrait for Array<'a, P, W>
/// Return memory size of `Array` representation
#[inline]
fn size_of(&self) -> usize {
size_of::<usize>() + size_of_val(self.arr)
// The size of the len attribute, which is an usize
let len_size = size_of::<usize>();
// The size of the cap attribute, which is an usize
let cap_size = size_of::<usize>();
// The size of the arr attribute, which is a slice
// and therefore composed of a pointer (usize) and a length (usize)
let arr_size = size_of::<usize>() * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the "multiply by 2" aspect of this logic, however I don't think we should measure memory usage of Array and other structs except CardinalityEstimator since these are stack allocated and not actually stored / exposed beyond pub(crate) visibility.

// The size of the values in the arr attribute, which are
// u32 values
let arr_size_values = size_of_val(self.arr);

len_size + cap_size + arr_size + arr_size_values
}

/// Free memory occupied by the `Array` representation
Expand Down
32 changes: 30 additions & 2 deletions src/estimator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ where

/// Return memory size of `CardinalityEstimator`
pub fn size_of(&self) -> usize {
self.representation().size_of()
self.representation().size_of() + std::mem::size_of::<Self>()
}
}

Expand Down Expand Up @@ -181,10 +181,38 @@ where
H: Hasher + Default,
{
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
write!(f, "{:?}", self.representation())

#[cfg(not(feature = "mem_dbg"))]
return write!(f, "{:?}", self.representation());

#[cfg(feature = "mem_dbg")]
{
use mem_dbg::MemSize;
return write!(f, "size_of = {:?} mem_dbg = {:?}", self.size_of(), <Self as MemSize>::mem_size(self, mem_dbg::SizeFlags::default() | mem_dbg::SizeFlags::FOLLOW_REFS));
}
}
}

#[cfg(feature = "mem_dbg")]
impl<T, H, const P: usize, const W: usize> mem_dbg::MemSize for CardinalityEstimator<T, H, P, W>
where
T: Hash + ?Sized,
H: Hasher + Default,
{
fn mem_size(&self, flags: mem_dbg::SizeFlags) -> usize {
core::mem::size_of::<Self>() + <Representation<P, W> as mem_dbg::MemSize>::mem_size(&self.representation(), flags)
}
}

#[cfg(feature = "mem_dbg")]
impl<T, H, const P: usize, const W: usize> mem_dbg::MemDbgImpl for CardinalityEstimator<T, H, P, W>
where
T: Hash + ?Sized,
H: Hasher + Default,
{

}

#[cfg(test)]
pub mod tests {
use super::*;
Expand Down
10 changes: 8 additions & 2 deletions src/hyperloglog.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ use crate::representation::RepresentationTrait;
const PTR_MASK: usize = !3;

#[derive(PartialEq)]
#[cfg_attr(feature = "mem_dbg", derive(mem_dbg::MemDbg, mem_dbg::MemSize))]
pub(crate) struct HyperLogLog<'a, const P: usize = 12, const W: usize = 6> {
pub(crate) data: &'a mut [u32],
}
Expand Down Expand Up @@ -151,7 +152,12 @@ impl<'a, const P: usize, const W: usize> RepresentationTrait for HyperLogLog<'a,
/// Return memory size of `HyperLogLog`
#[inline]
fn size_of(&self) -> usize {
size_of::<usize>() + size_of_val(self.data)
// The size of the slice reference, which is composed by a pointer (usize) and a length (usize)
let slice_size = size_of::<&mut [u32]>();
// The size of the values in the slice, which are u32 values
let slice_size_values = size_of_val(self.data);

slice_size + slice_size_values
}

/// Free memory occupied by the `HyperLogLog` representation
Expand Down Expand Up @@ -318,7 +324,7 @@ const BETA: [[f64; 8]; 15] = [
-1.65687801845180e-02,
-7.95829341087617e-02,
4.71830602102918e-02,
-7.81372902346934e03,
-7.81372902346934e-03,
5.84268708489995e-04,
],
// p = 12
Expand Down
64 changes: 61 additions & 3 deletions src/representation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ const REPRESENTATION_ARRAY: usize = 0x0000_0000_0000_0001;
const REPRESENTATION_HLL: usize = 0x0000_0000_0000_0003;

/// Representation types supported by `CardinalityEstimator`
#[repr(u8)]
#[derive(Debug, PartialEq)]
// #[repr(u8)]
#[derive(PartialEq)]
#[cfg_attr(feature = "mem_dbg", derive(mem_dbg::MemDbg, mem_dbg::MemSize))]
#[enum_dispatch]
pub(crate) enum Representation<'a, const P: usize, const W: usize> {
Small(Small<P, W>),
Expand All @@ -25,7 +26,6 @@ pub(crate) enum Representation<'a, const P: usize, const W: usize> {
}

/// Representation trait which must be implemented by all representations.
#[enum_dispatch(Representation<P, W>)]
pub(crate) trait RepresentationTrait {
fn insert_encoded_hash(&mut self, h: u32) -> usize;
fn estimate(&self) -> usize;
Expand All @@ -37,6 +37,64 @@ pub(crate) trait RepresentationTrait {
}
}

impl<'a, const P: usize, const W: usize> core::fmt::Debug for Representation<'a, P, W> {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.to_string())
}
}

impl<'a, const P: usize, const W: usize> RepresentationTrait for Representation<'a, P, W> {
#[inline]
fn insert_encoded_hash(&mut self, h: u32) -> usize {
match self {
Representation::Small(s) => s.insert_encoded_hash(h) as usize,
Representation::Array(a) => a.insert_encoded_hash(h) as usize,
Representation::Hll(hll) => hll.insert_encoded_hash(h) as usize,
}
}

#[inline]
fn estimate(&self) -> usize {
match self {
Representation::Small(s) => s.estimate(),
Representation::Array(a) => a.estimate(),
Representation::Hll(hll) => hll.estimate(),
}
}

#[inline]
fn size_of(&self) -> usize {
// A priori, we do not know which of the child structs is the largest and therefore
// composing the bulk of the memory usage by the enum. Therefore, we subtract the size of
// the variants, and then we sum the size of the enum itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually Representation enum is not stored directly and has pub(crate) visibility. It's only created when accessing CardinalityEstimator methods and usually optimized by compiler anyway. What's stored is CardinalityEstimator struct and its data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see, so basically what I am currently measuring would be the 'runtime peak size' in place of the storage size. I suppose the question then is which one we are more interested in measuring. Which one would you rather go for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're primarily interested in measuring heap allocated memory usage of overall CardinalityEstimator and not the temporary used individual sub-structs Small, Array and HyperLogLog, so that we can reliably and accurately measure how much memory Vec or HashMap of 1M or 100M of CardinalityEstimator uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Representation enum and its sub-structs are never actually stored on the heap, these are stack allocated and quickly disposed structs. I think we should only report how much memory CardinalityEstimator uses.

let size_of_variant = match self {
Representation::Small(s) => s.size_of() - size_of::<Small<P, W>>(),
Representation::Array(a) => a.size_of() - size_of::<Array<'a, P, W>>(),
Representation::Hll(hll) => hll.size_of() - size_of::<HyperLogLog<'a, P, W>>(),
};

size_of_variant + size_of::<Self>()
}

#[inline]
unsafe fn drop(&mut self) {
match self {
Representation::Small(s) => s.drop(),
Representation::Array(a) => a.drop(),
Representation::Hll(hll) => hll.drop(),
}
}

#[inline]
fn to_data(&self) -> usize {
match self {
Representation::Small(s) => s.to_data(),
Representation::Array(a) => a.to_data(),
Representation::Hll(hll) => hll.to_data(),
}
}
}

/// Representation error
#[derive(Debug)]
pub enum RepresentationError {
Expand Down
1 change: 1 addition & 0 deletions src/small.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ const SMALL_MASK: usize = 0x0000_0000_7fff_ffff;

/// Small representation container
#[derive(PartialEq)]
#[cfg_attr(feature = "mem_dbg", derive(mem_dbg::MemDbg, mem_dbg::MemSize))]
pub(crate) struct Small<const P: usize, const W: usize>(usize);

impl<const P: usize, const W: usize> Small<P, W> {
Expand Down