Skip to content

Conversation

guonaihong
Copy link
Contributor

@guonaihong guonaihong commented Jun 17, 2025

  1. Problem
    There is an issue in production where when go-zero uses etcd, under certain conditions, the registered key disappears, causing frontend services to be unable to find backend services.

  2. Reproduction conditions
    Then I simulated it locally by deleting the registered key in etcd. It won't regenerate, even if the service process still exists. So I used a simple strategy for re-registration: if getting this key doesn't exist, then put it. After using the new strategy, deleting this key will regenerate it.

(Since the time spent handling this was relatively short, only 30 minutes, the PR may not be very complete. Welcome to discuss with the author)

Copy link

codecov bot commented Jun 17, 2025

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
core/discov/publisher.go 62.50% 5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@zcong1993
Copy link
Member

这种极端情况确实存在,需要重新注册服务

Copy link
Member

@zcong1993 zcong1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

希望pr能够补充相应单测

@@ -125,6 +125,8 @@ func (p *Publisher) keepAliveAsync(cli internal.EtcdClient) error {
}

threading.GoSafe(func() {
ticker := time.NewTicker(5 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以调整到30s,兜底check不需要这么频繁

@@ -135,6 +137,16 @@ func (p *Publisher) keepAliveAsync(cli internal.EtcdClient) error {
}
return
}
case <-ticker.C:
resp, err := cli.Get(cli.Ctx(), p.fullKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分可以复用 doKeepAlive 的逻辑,感觉可以写成:

				resp, err := cli.Get(cli.Ctx(), p.fullKey)
				if err != nil {
					logc.Errorf(cli.Ctx(), "etcd publisher check register key error: %s", err.Error())
					continue
				}
				if len(resp.Kvs) == 0 {
					logc.Errorf(cli.Ctx(), "etcd publisher key %s not found, will re-register", p.fullKey)
					if err := p.doKeepAlive(); err != nil {
						logc.Errorf(cli.Ctx(), "etcd publisher KeepAlive: %s", err.Error())
					}
					return
				}

@guonaihong
Copy link
Contributor Author

这两天太忙了,周六或者周日我再完善下。

@kevwan kevwan changed the title fix: etcd的key消失,不能自动重新注册的问题 fix: Issue with etcd key disappearing and unable to auto-re-register Jun 19, 2025
@guonaihong
Copy link
Contributor Author

优化了下,从定时器检查改为watch这个key,如果收到delete事件就会重新注册。

@kevwan kevwan force-pushed the fix-etcd-register branch from f979851 to 04fe8dd Compare July 12, 2025 16:06
@kevwan kevwan self-assigned this Jul 12, 2025
@kevwan
Copy link
Contributor

kevwan commented Jul 12, 2025

Thanks for your contribution!

Will be merged in v1.9.0, which will be released in August.

@kevwan kevwan added this to the v1.9.0 milestone Jul 13, 2025
@kevwan kevwan force-pushed the fix-etcd-register branch 2 times, most recently from 1411476 to 7dc1613 Compare July 18, 2025 15:12
@kevwan kevwan force-pushed the fix-etcd-register branch from 7dc1613 to e833031 Compare August 1, 2025 13:30
@kevwan kevwan force-pushed the fix-etcd-register branch from e833031 to 585d307 Compare August 1, 2025 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants