Feat: Add Watchdog to Revive Server Node #91

gmlwns2000 · 2025-10-08T06:17:09Z

Watchdog (Single Node) Command

BSA_K=32 \
BSA_EXACT_K=32 \
BSA_BLOCK_K=64 \
HIP_DEBUG_DELTA_QSA=1 \
HIP_DEBUG_RECOMPUTE_SPLIT=0 \
TRITON_PRINT_AUTOTUNING=1 \
SRT_WARMUP_ALL_SEQ_LENS=0 \
HIP_DEBUG_FA3_MIXING_LEN=0 \
PASSKEY_DECODE_LEN=128 \
PASSKEY_LEN=150 \
SA_BLOCK_SIZE=128 \
SA_DECODE_BLOCK_SIZE=128 \
HIP_DISABLE_AUTOTUNE=0 \
HIP_DEBUG=0 \
HIP_DEBUG_BENCH=0 \
HIP_DEBUG_CAPTURE_DECORATOR=1 \
CUDA_LAUNCH_BLOCKING=0 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
$(which python) -m hip_attn.utils.sglang_watchdog \
-- \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
--kv-cache-dtype auto \
--ep-size 8 \
--tp-size 8 \
--chunked-prefill-size 65536 \
--max-prefill-tokens 65536 \
--cuda-graph-bs 1 2 4 8 16 24 32 48 64 96 128 160 192 256 \
--context-length 256000 \
--max-total-tokens 256000 \
--attention-backend hip_attention \
--hip-attention-config ./configs/mixed_landmark_0814_no_extend_qsa.json \
--hip-attention-config-override-json '{"__seq_thresh_fa3": 65536}' \
--json-model-override-args  '{"rope_scaling":{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}, "max_position_embeddings": 262144}' \
--max-running-requests 64 \
--trust-remote-code \
--tool-call-parser qwen25

Router Command

./target/release/sglang-router \
    --host 0.0.0.0 --port 30000 \
    --worker-urls \
        http://h100-80-1:8000 \
        http://h100-80-2:8000 \
    --prometheus-host 0.0.0.0 \
    --prometheus-port 9000

gmlwns2000 added 3 commits October 5, 2025 12:59

add watchdog :D

53b800e

fix watchdog

beeac38

fix watchdog

744f5ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Add Watchdog to Revive Server Node #91

Feat: Add Watchdog to Revive Server Node #91

Uh oh!

gmlwns2000 commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: Add Watchdog to Revive Server Node #91

Are you sure you want to change the base?

Feat: Add Watchdog to Revive Server Node #91

Uh oh!

Conversation

gmlwns2000 commented Oct 8, 2025

Watchdog (Single Node) Command

Router Command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant