-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Jira Link: DB-18379
Description
BackfillChunk errors are classified as either retryable or nonretryable. The nonretryable ones abort the entire backfill process and prevent the index from being created to completion. The classification is largely implemented as a whitelist of retryable errors, and the rest are considered nonretryable.
We have seen cases where Backfill fails due to rare, unexpected errors such as PerformRPC timed out. A lot of transient errors like these are retryable. It is probably better to switch from a whitelist of retryable errors to a blacklist of nonretryable errors, such as duplicate key error for unique indexes or failure to establish a connection to PG. But it may be difficult to construct such a list. It is more frustrating for a backfill to fail from a flaky issue rather than it succeed with a lot of retries and slow perf.
Also, the BackfillChunk max retries is set to 150, which is too high. It should be lowered to, say, 15 for retryable errors and 5 for nonretryable errors. I know it sounds counterintuitive for nonretryable errors to be allowed retries, but this is considering the old approach of a whitelist of retryable errors and everything else being nonretryable. In this case, you could consider "nonretryable" to be synonymous to "unclassified".
Issue Type
kind/enhancement
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.