Skip to content

Conversation

@junngo
Copy link
Contributor

@junngo junngo commented Dec 3, 2025

Hi there :)
There was a DB overload issue in #9085, so I opened a new one with an improved query.
Normally, when the PerformanceDatumReplicate table is searched, the optimizer chooses a full(sequential) table scan, which causes heavy load on the database.
I checked the query using the GCP Insights menu, and the previous version took about 2–3 minutes per call, leading to overload.
Now it runs in 0–2 seconds, which solves the overload.

Documenting this here for future reference:

  1. Simple IN lookup
SELECT * FROM performance_datum_replicate WHERE performance_datum_id IN (1, 2, 3, ... 10,000)

The optimizer chose the full(sequential) scan, even though the performance_datum_id is an indexed column. It overestimates the number of matching rows, assumes a large result set, and decides to do the full scan, regardless of whether data actually exists or not.

  1. Join (small result → index, bigger result → full scan)
SELECT "performance_datum_replicate"."performance_datum_id", "performance_datum_replicate"."value"
FROM "performance_datum"
INNER JOIN "performance_datum_replicate" ON ("performance_datum_replicate"."performance_datum_id" = "performance_datum"."id")
WHERE (
    "performance_datum"."push_timestamp" >= '2025-11-18'
    AND "performance_datum"."repository_id" = 77
    AND "performance_datum"."signature_id" = 5419513
)

When the performance_datum result is small, performance_datum_replicate is scanned using the index.
But as soon as the performance_datum result grows even a little, the optimizer falls back to a full table scan.
The join itself is fine, but the driving table size affects optimizer decisions and can trigger full scan fallback.

  1. Subquery + Exists (index scan)
SELECT "performance_datum_replicate"."performance_datum_id", "performance_datum_replicate"."value" 
FROM "performance_datum_replicate" 
WHERE "performance_datum_replicate"."performance_datum_id" IN (
    SELECT V0."id"
    FROM "performance_datum" V0
    WHERE (
        V0."push_timestamp" >= '2025-11-19T05:26:58.794520'::timestamp 
        AND V0."repository_id" = 77 
        AND V0."signature_id" = 4859357
        AND EXISTS(
            SELECT 1 AS "a" 
            FROM "performance_datum_replicate" U0 
            WHERE U0."performance_datum_id" = (V0."id") 
            LIMIT 1
        )
    )
)

I implemented the ORM code using the query above. The EXISTS reduces the candidates efficiently and Subquery guides PostgreSQL to fetch replicates using index lookups instead of full scans.
When I tested this in redash, queries usually completed in about 0–2 seconds. (But if the query is not cached or depending on the DB state, it may take a bit longer, though the ORM is designed to encourage index scans.)

Performance Test in Redash

Frequently Used (Large) Signatures

  • When a signature contains 10,000+ PerformanceDatum rows, execution generally benefits from cache locality.
  • Cold execution: 3-8 seconds
    • The first runs may incur disk i/o while loading relevant performance_datum_replicate blocks into shared buffers and page cache.
  • Warm execution: ~1 second
    • Subsequent executions for the same signature use cached blocks only, avoiding additional disk fetches

Infrequently Used (Small) Signatures

  • These signatures contain a small number of datum rows, leading to fast execution regardless of caching state.
  • Execution time: ~1 second

Note

Cache effectiveness scales with signature size and task frequency.
Large signatures → frequently queried → highly likely to be warm-cached.
Small signatures → lightly queried → inherently fast due to low data volume.

@junngo junngo marked this pull request as ready for review December 3, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant