Skip to content

Swarm balancing logic issues #389

@fadenb

Description

@fadenb

Hey 👋,

I am opening this issue to discuss the current swarm balancing approach.

Recently I have seen that the public swarm hosting enoch/llama-65b-hf is unbalanced.
This by itself is not a surprise nor a problem. The issue is then remediated by the server loading other blocks. All good so far.

Today I noticed that my server is loading the same blocks it had before. As the loading process is quite slow (often around 10 minutes), this basically takes away the compute capacity of that server from the swarm for 10 minutes without providing any benefit.

A log excerpt might explain the situation better:
Notice that [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] is loaded initially and also the exact same [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] to rebalance it.

Jul 20 12:08:47.880 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
Jul 20 12:08:47.880 [INFO] Using DHT prefix: llama-65b-hf
Jul 20 12:08:57.909 [INFO] This server is accessible directly
Jul 20 12:09:02.623 [INFO] Connecting to the public swarm
Jul 20 12:09:02.624 [INFO] Running a server on ['/ip4/172.17.0.2/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf', '/ip4/127.0.0.1/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf', '/ip4/147.189.193.61/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf']
Jul 20 12:09:02.646 [INFO] Model weights are loaded in float16, quantized to nf4 format
Jul 20 12:09:02.647 [INFO] Attention cache for all blocks will consume up to 1.25 GiB
Jul 20 12:09:02.648 [INFO] Loading throughput info
Jul 20 12:09:02.684 [INFO] Reporting throughput: 2203.3 RPS for 20 blocks
Jul 20 12:09:04.430 [INFO] Reachability service started
Jul 20 12:09:08.345 [INFO] Announced that blocks [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are joining
Jul 20 12:09:15.051 [INFO] Loaded enoch/llama-65b-hf block 60, <All keys matched successfully>
Downloading (…)/adapter_config.json: 100%|██████████| 425/425 [00:00<00:00, 2.09MB/s]
Downloading (…)er_model.safetensors: 100%|██████████| 3.20G/3.20G [00:54<00:00, 58.3MB/s]
Jul 20 12:10:36.878 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:10:37.081 [INFO] Loaded adapter timdettmers/guanaco-65b for block 60
Jul 20 12:10:44.745 [INFO] Loaded enoch/llama-65b-hf block 61, <All keys matched successfully>
Jul 20 12:11:08.242 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:11:08.441 [INFO] Loaded adapter timdettmers/guanaco-65b for block 61
Jul 20 12:11:16.205 [INFO] Loaded enoch/llama-65b-hf block 62, <All keys matched successfully>
Jul 20 12:11:38.475 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:11:38.669 [INFO] Loaded adapter timdettmers/guanaco-65b for block 62
Jul 20 12:11:45.308 [INFO] Loaded enoch/llama-65b-hf block 63, <All keys matched successfully>
Jul 20 12:12:08.372 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:12:08.595 [INFO] Loaded adapter timdettmers/guanaco-65b for block 63
Jul 20 12:12:17.520 [INFO] Loaded enoch/llama-65b-hf block 64, <All keys matched successfully>
Jul 20 12:12:40.703 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:12:41.066 [INFO] Loaded adapter timdettmers/guanaco-65b for block 64
Jul 20 12:12:48.411 [INFO] Loaded enoch/llama-65b-hf block 65, <All keys matched successfully>
Jul 20 12:12:59.529 [INFO] reachability.rpc_check(remote_peer=...ZFKwzs, check_peer=...ZFKwzs) -> False
Jul 20 12:13:11.434 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:13:11.927 [INFO] Loaded adapter timdettmers/guanaco-65b for block 65
Jul 20 12:13:19.812 [INFO] Loaded enoch/llama-65b-hf block 66, <All keys matched successfully>
Jul 20 12:13:43.257 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:13:43.823 [INFO] Loaded adapter timdettmers/guanaco-65b for block 66
Jul 20 12:13:51.392 [INFO] Loaded enoch/llama-65b-hf block 67, <All keys matched successfully>
Jul 20 12:14:16.225 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:14:16.776 [INFO] Loaded adapter timdettmers/guanaco-65b for block 67
Jul 20 12:14:25.466 [INFO] Loaded enoch/llama-65b-hf block 68, <All keys matched successfully>
Jul 20 12:14:49.068 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:14:49.586 [INFO] Loaded adapter timdettmers/guanaco-65b for block 68
Jul 20 12:14:57.751 [INFO] Loaded enoch/llama-65b-hf block 69, <All keys matched successfully>
Jul 20 12:15:20.843 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:15:21.370 [INFO] Loaded adapter timdettmers/guanaco-65b for block 69
Jul 20 12:15:34.991 [INFO] Loaded enoch/llama-65b-hf block 70, <All keys matched successfully>
Jul 20 12:15:57.221 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:15:57.713 [INFO] Loaded adapter timdettmers/guanaco-65b for block 70
Jul 20 12:16:08.368 [INFO] Loaded enoch/llama-65b-hf block 71, <All keys matched successfully>
Jul 20 12:16:29.393 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:16:29.884 [INFO] Loaded adapter timdettmers/guanaco-65b for block 71
Jul 20 12:16:36.503 [INFO] Loaded enoch/llama-65b-hf block 72, <All keys matched successfully>
Jul 20 12:16:57.748 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:16:58.263 [INFO] Loaded adapter timdettmers/guanaco-65b for block 72
Jul 20 12:17:05.251 [INFO] Loaded enoch/llama-65b-hf block 73, <All keys matched successfully>
Jul 20 12:17:26.114 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:17:26.605 [INFO] Loaded adapter timdettmers/guanaco-65b for block 73
Jul 20 12:17:33.660 [INFO] Loaded enoch/llama-65b-hf block 74, <All keys matched successfully>
^[OP^[OPJul 20 12:17:54.764 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:17:55.280 [INFO] Loaded adapter timdettmers/guanaco-65b for block 74
Jul 20 12:18:02.302 [INFO] Loaded enoch/llama-65b-hf block 75, <All keys matched successfully>
^[OP^[OPJul 20 12:18:23.076 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:18:23.551 [INFO] Loaded adapter timdettmers/guanaco-65b for block 75
Jul 20 12:18:30.137 [INFO] Loaded enoch/llama-65b-hf block 76, <All keys matched successfully>
Jul 20 12:18:50.908 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:18:51.420 [INFO] Loaded adapter timdettmers/guanaco-65b for block 76
Jul 20 12:18:57.203 [INFO] Loaded enoch/llama-65b-hf block 77, <All keys matched successfully>
Jul 20 12:19:17.972 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:19:18.472 [INFO] Loaded adapter timdettmers/guanaco-65b for block 77
Jul 20 12:19:23.977 [INFO] Loaded enoch/llama-65b-hf block 78, <All keys matched successfully>
Jul 20 12:19:44.690 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:19:45.199 [INFO] Loaded adapter timdettmers/guanaco-65b for block 78
Jul 20 12:19:50.305 [INFO] Loaded enoch/llama-65b-hf block 79, <All keys matched successfully>
Jul 20 12:20:11.381 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:20:11.894 [INFO] Loaded adapter timdettmers/guanaco-65b for block 79
Jul 20 12:20:11.962 [WARN] [petals.server.reachability.validate_reachability:40] Skipping reachability check because health.petals.ml is down: ConnectionError(MaxRetryError("HTTPConnectionPool(host='health.petals.ml', port=80): Max retries exceeded with url: /api/v1/is_reachable/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbf09f084f0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))
Jul 20 12:20:14.168 [INFO] Started
Jul 20 12:26:02.132 [INFO] Swarm balance quality: 65.3%
Jul 20 12:26:02.133 [INFO] Swarm is imbalanced, server will load other blocks
Jul 20 12:26:03.947 [INFO] Announced that blocks ['llama-65b-hf.60', 'llama-65b-hf.61', 'llama-65b-hf.62', 'llama-65b-hf.63', 'llama-65b-hf.64', 'llama-65b-hf.65', 'llama-65b-hf.66', 'llama-65b-hf.67', 'llama-65b-hf.68', 'llama-65b-hf.69', 'llama-65b-hf.70', 'llama-65b-hf.71', 'llama-65b-hf.72', 'llama-65b-hf.73', 'llama-65b-hf.74', 'llama-65b-hf.75', 'llama-65b-hf.76', 'llama-65b-hf.77', 'llama-65b-hf.78', 'llama-65b-hf.79'] are offline
Jul 20 12:26:06.251 [INFO] Shutting down
Jul 20 12:26:06.266 [INFO] Module container shut down successfully
Jul 20 12:26:06.492 [INFO] Cleaning up, left 0.3 GiB allocated memory, 6.3 GiB reserved memory
Jul 20 12:26:12.177 [INFO] Announced that blocks [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are joining
Jul 20 12:26:19.559 [INFO] Loaded enoch/llama-65b-hf block 60, <All keys matched successfully>
Jul 20 12:26:41.387 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:26:41.927 [INFO] Loaded adapter timdettmers/guanaco-65b for block 60
Jul 20 12:26:49.273 [INFO] Loaded enoch/llama-65b-hf block 61, <All keys matched successfully>
Jul 20 12:27:13.392 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:27:13.971 [INFO] Loaded adapter timdettmers/guanaco-65b for block 61
Jul 20 12:27:21.899 [INFO] Loaded enoch/llama-65b-hf block 62, <All keys matched successfully>
Jul 20 12:27:43.149 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:27:43.671 [INFO] Loaded adapter timdettmers/guanaco-65b for block 62
Jul 20 12:27:50.241 [INFO] Loaded enoch/llama-65b-hf block 63, <All keys matched successfully>
Jul 20 12:28:11.106 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:28:11.609 [INFO] Loaded adapter timdettmers/guanaco-65b for block 63
Jul 20 12:28:18.728 [INFO] Loaded enoch/llama-65b-hf block 64, <All keys matched successfully>
Jul 20 12:28:40.008 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:28:40.396 [INFO] Loaded adapter timdettmers/guanaco-65b for block 64
Jul 20 12:28:48.484 [INFO] Loaded enoch/llama-65b-hf block 65, <All keys matched successfully>
Jul 20 12:29:09.470 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout

While this is an extreme example of the problem, I have seen (more often) that parts of the block lists overlap. In such cases, the overlapping blocks are still loaded from scratch instead of being reused.

Are there any obvious fixes for this behavior besides adjusting the --balance_quality setting or pinning blocks?
Should we reorder the actions so that the new blocks will be selected before the decision is made to unload the blocks?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions