Run Petals server on Windows

Petals doesn't support Windows natively at the moment - you have to use WSL and/or Docker. In this guide, we'll show how to set up Petals on WSL.

Tutorial

In WSL, install basic Python stuff:

sudo apt update
sudo apt install python3-pip python-is-python3

Then, install Petals:

python -m pip install git+https://github.com/bigscience-workshop/petals

Run the Petals server:
```
python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b
```
This will host a part of LLaMA-65B with optional Guanaco adapters on your machine. You can also host meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-70b-chat-hf, bigscience/bloom, bigscience/bloomz, and other compatible models from 🤗 Model Hub, or add support for new model architectures.

If you want to share multiple GPUs, you should run a Petals server for each. Open a separate WSL console for each, then run this in the first console:
```
CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b
```
Do the same for each console, replacing CUDA_VISIBLE_DEVICES=0 with CUDA_VISIBLE_DEVICES=1, CUDA_VISIBLE_DEVICES=2, etc.
Once all blocks are loaded, check that your server is available on https://health.petals.dev/

Petals will use NAT traversal via relays by default, but you can make it available directly if your computer has a public IP address.

In WSL, find out the IP address of your WSL container (172.X.X.X):
```
sudo apt install net-tools
ifconfig
```

Allow traffic to be routed into the WSL container (replace 172.X.X.X with your actual IP):

netsh interface portproxy add v4tov4 listenport=31330 listenaddress=0.0.0.0 connectport=31330 connectaddress=172.X.X.X

Set up your firewall (e.g., Windows Defender) to allow traffic from the outside world to the port 31330/tcp.
If you have a router, set it up to allow connections from the outside world (port 31330/tcp) to your computer (port 31330/tcp).

Run the Petals server with the parameter --port 31330:

python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b --port 31330

Ensure that the server prints This server is available directly (not via relays) after startup.

I get this error: hivemind.dht.protocol.ValidationError: local time must be within 3 seconds of others on WSL. What should I do?

Petals needs clocks on all nodes to be synchronized. Please set the date using an NTP server: ntpdate pool.ntp.org
I get this error: torch.cuda.OutOfMemoryError: CUDA out of memory. What should I do?

If you use an Anaconda env, run this before starting the server:
```
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
```
If you use Docker, add this argument after --rm in the Docker command:
```
-e "PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128"
```

This project is a part of the BigScience research workshop.