Skip to content

Conversation

@sidd-27
Copy link
Contributor

@sidd-27 sidd-27 commented Dec 31, 2025

This fixes the error in the capacity calculation math, where more than 6 times more space was being allocated than needed, because the capacity math used permutations instead of combination

@sidd-27
Copy link
Contributor Author

sidd-27 commented Dec 31, 2025

I'm a maintainer at kornia-rs, which has its own native implementation of apriltag, and I made a few observations

The current structure for the table slots is -

struct quick_decode_entry
{
    uint64_t rcode;   // the queried code
    uint16_t id;      // the tag ID (a small integer)
    uint8_t hamming;  // how many errors corrected?
    uint8_t rotation; // number of rotations [0, 3]
};

which, ends up being 16 bytes long due to alignment, thus we lose 4 bytes to padding. To remedy this, we could decouple the rcode and the rest of the metadata to achieve something like this

struct QuickDecodeEntry {
// [ ID (16) | Hamming (6) | Rot (2) ] stored in 3 bytes
uint8_t data[3];
};

struct QuickDecode {
vector<uint64_t > rcodes;
vector entries;
};

storing exactly the amount of data we need in just 11 bytes per entry, resulting in a 31% improvement in memory usage. Another benefit is more optimal cache utilisation, as during matching, we no longer need to load in the metadata, which polluted the cache line

i did a simple implementation of this in Rust, and these were the results I got -

Creation – Original: ~14.7 ms
Creation – Optimized: ~14.1 ms

Lookup – Original: ~872 µs
Lookup – Optimized: ~610 µs, ~30% faster

Would you be interested in these optimizations?

@sidd-27
Copy link
Contributor Author

sidd-27 commented Dec 31, 2025

Additionally, another minor point: Is the rotation field in the quickdecode table effectively useless? It's never set anywhere and is immediately overwritten in the quick_decode_codeword function. pretty sure we can safely remove it; it would not save space in the current implementation due to padding, but still a good thing to clean up, in a more optimized implementation, we could only store the "cannonical" form of the code, ie the one which is the minimal in value amongst all 4 rotations, and also store how many rotations it took to achieve that, and while lookups, instead of 4 lookups one for each rotation, we could just lookup the minimum rotation, and calculate how much offset it took, thereby reducing the number of random accesses from 4 to 1, which may be very good for performance if the tables are larger and don't fit in CPU caches

this with the one above beings the space needed from 6.3 gbs to 2.2 Gbs, while having a 3 times faster quick decode function

@christian-rauch
Copy link
Collaborator

Would you be interested in these optimizations?

Yes, sure. If you improve the runtime performance without affecting the detection performance, why not.

@christian-rauch
Copy link
Collaborator

Additionally, another minor point: Is the rotation field in the quickdecode table effectively useless? It's never set anywhere and is immediately overwritten in the quick_decode_codeword function.

The rotation is set in quick_decode_codeword and used in quad_decode_task.

@sidd-27
Copy link
Contributor Author

sidd-27 commented Dec 31, 2025

I meant the space in the table to store the rotation is not used efficiently, since the value is computed inside the function, it's possible to remove it entirely from the table and have the rotation value returned some other way

@christian-rauch
Copy link
Collaborator

I meant the space in the table to store the rotation is not used efficiently, since the value is computed inside the function, it's possible to remove it entirely from the table and have the rotation value returned some other way

I see. You mean because struct quick_decode_entry entry; is created set and read inside quad_decode_task, the rotation value could be passed differently. You are right. I guess, quad_decode could set the id, hamming, and H, in det directly instead of storing it in quick_decode_entry before.

@christian-rauch
Copy link
Collaborator

Fixes #413.

@christian-rauch
Copy link
Collaborator

I can confirm that this roughly halves the memory requirements for the tagStandard52h13 family.

@christian-rauch christian-rauch merged commit d63f817 into AprilRobotics:master Jan 3, 2026
51 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants