[CELEBORN-2183] Fix client tries to start a connection with excluded workers #3517

r7raul1984 · 2025-10-23T03:03:33Z

What changes were proposed in this pull request?

Do not attempt to fetch from any worker that has already been added to fetchExcludedWorkers. Instead, try fetching from a peer worker within the same PartitionLocation.

Why are the changes needed?

With the configuration celeborn.client.fetch.excludeWorkerOnFailure.enabled=true and celeborn.client.push.replicate.enabled=true,
when a fetch operation fails for a specific worker, that worker is added to fetchExcludedWorkers. However, subsequent attempts still try to fetch from this worker, causing repeated errors and slowing down shuffle fetch performance.
The logic should be modified so that any worker already in fetchExcludedWorkers is completely excluded from further fetch operations.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test by debug

…d workers

SteNicholas

@r7raul1984, thanks for contribution. Please run dev/reformat to format.

r7raul1984 · 2025-10-23T03:55:02Z

@r7raul1984, thanks for contribution. Please run dev/reformat to format.

ok

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

…on to peer

RexXiong · 2025-11-04T07:44:39Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

          fetchChunkRetryCnt++;
-          if (location.hasPeer() && !readSkewPartitionWithoutMapRange) {
+            if (location.hasPeer()
+                    && !isExcluded(location.getPeer())


It appears that isExcluded(location.getPeer()) now serves the same purpose as isExcluded(location) at line 455.

My idea is to add this here to prevent throwing a new CelebornIOException at line 456.

SteNicholas · 2025-11-07T02:03:50Z

@r7raul1984, any update?

r7raul1984 · 2025-11-07T02:50:52Z

@r7raul1984, any update?

No more update

SteNicholas · 2025-11-17T12:28:39Z

@r7raul1984, the CI run failed. PTAL.

r7raul1984 · 2025-11-17T12:32:59Z

PTAL

ok

r7raul1984 · 2025-11-18T02:51:57Z

@r7raul1984, the CI run failed. PTAL.

how to rerun these checks?

SteNicholas · 2025-11-18T14:18:40Z

@r7raul1984, you'd better to close this pull request and reopen.

r7raul1984 · 2025-11-19T06:55:16Z

@r7raul1984, you'd better to close this pull request and reopen.

It’s very strange — there are three checks that always fail randomly.

github-actions · 2025-12-09T08:40:07Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

[CELEBORN-2183]Client always tries to start a connection with exclude…

a11ceda

…d workers

github-actions bot added the module:client label Oct 23, 2025

SteNicholas changed the title ~~[CELEBORN-2183]Client always tries to start a connection with exclude…~~ [CELEBORN-2183] Client always tries to start a connection with exclude wrokers Oct 23, 2025

SteNicholas requested changes Oct 23, 2025

View reviewed changes

[CELEBORN-2183]fix format problem

db03e20

r7raul1984 requested a review from SteNicholas October 23, 2025 04:42

SteNicholas changed the title ~~[CELEBORN-2183] Client always tries to start a connection with exclude wrokers~~ [CELEBORN-2183] Client always tries to start a connection with exclude workers Oct 23, 2025

SteNicholas changed the title ~~[CELEBORN-2183] Client always tries to start a connection with exclude workers~~ [CELEBORN-2183] Client always tries to start a connection with excluded workers Oct 23, 2025

SteNicholas changed the title ~~[CELEBORN-2183] Client always tries to start a connection with excluded workers~~ [CELEBORN-2183] Fix client tries to start a connection with excluded workers Oct 23, 2025

RexXiong reviewed Oct 23, 2025

View reviewed changes

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java Outdated Show resolved Hide resolved

raul.tang.contractor added 2 commits November 3, 2025 12:38

[CELEBORN-2183]Should send bufferStreamEnd first before change locati…

e2d92be

…on to peer

[CELEBORN-2183]add isexcluded check before change to peer

009eb6a

RexXiong reviewed Nov 4, 2025

View reviewed changes

SteNicholas requested a review from RexXiong November 17, 2025 06:02

r7raul1984 closed this Nov 19, 2025

r7raul1984 reopened this Nov 19, 2025

[CELEBORN-2183]using mvn spotless apply to clean the code

2b0e5ba

r7raul1984 closed this Nov 19, 2025

r7raul1984 reopened this Nov 19, 2025

github-actions bot added stale and removed stale labels Dec 9, 2025

[CELEBORN-2183] Fix client tries to start a connection with excluded workers #3517

Are you sure you want to change the base?

[CELEBORN-2183] Fix client tries to start a connection with excluded workers #3517

Conversation

r7raul1984 commented Oct 23, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SteNicholas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r7raul1984 commented Oct 23, 2025

Uh oh!

Uh oh!

RexXiong Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

r7raul1984 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

SteNicholas commented Nov 7, 2025

Uh oh!

r7raul1984 commented Nov 7, 2025

Uh oh!

SteNicholas commented Nov 17, 2025

Uh oh!

r7raul1984 commented Nov 17, 2025

Uh oh!

r7raul1984 commented Nov 18, 2025

Uh oh!

SteNicholas commented Nov 18, 2025

Uh oh!

r7raul1984 commented Nov 19, 2025

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SteNicholas left a comment •

edited

Loading