Skip to content
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions docs/source/en/guides/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -916,3 +916,32 @@ Manage scheduled jobs using
# Delete a scheduled job
>>> hf jobs scheduled delete <scheduled_job_id>
```

## hf inference-endpoints

Use `hf inference-endpoints` to list, deploy, describe, and manage Inference Endpoints directly from the terminal.

```bash
# Lists endpoints in your namespace
>>> hf inference-endpoints list


# Deploy an endpoint from Model Catalog
>>> hf inference-endpoints deploy catalog --repo openai/gpt-oss-120b --name my-endpoint

# Deploy an endpoint from the Hugging Face Hub
>>> hf inference-endpoints deploy hub my-endpoint --repo gpt2 --framework pytorch --accelerator cpu --instance-size x2 --instance-type intel-icl

# Show status and metadata
>>> hf inference-endpoints describe my-endpoint

# Pause the endpoint
>>> hf inference-endpoints pause my-endpoint

# Delete without confirmation prompt
>>> hf inference-endpoints delete my-endpoint --yes

```

> [!TIP]
> Add `--namespace` to target an organization, `--token` to override authentication.
43 changes: 43 additions & 0 deletions docs/source/en/guides/inference_endpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ The first step is to create an Inference Endpoint using [`create_inference_endpo
... )
```

Or via CLI:

```bash
hf inference-endpoints deploy hub my-endpoint-name --repo gpt2 --framework pytorch --accelerator cpu --vendor aws --region us-east-1 --instance-size x2 --instance-type intel-icl --task text-generation
```


In this example, we created a `protected` Inference Endpoint named `"my-endpoint-name"`, to serve [gpt2](https://huggingface.co/gpt2) for `text-generation`. A `protected` Inference Endpoint means your token is required to access the API. We also need to provide additional information to configure the hardware requirements, such as vendor, region, accelerator, instance type, and size. You can check out the list of available resources [here](https://api.endpoints.huggingface.cloud/#/v2%3A%3Aprovider/list_vendors). Alternatively, you can create an Inference Endpoint manually using the [Web interface](https://ui.endpoints.huggingface.co/new) for convenience. Refer to this [guide](https://huggingface.co/docs/inference-endpoints/guides/advanced) for details on advanced settings and their usage.

The value returned by [`create_inference_endpoint`] is an [`InferenceEndpoint`] object:
Expand All @@ -42,6 +49,12 @@ The value returned by [`create_inference_endpoint`] is an [`InferenceEndpoint`]
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)
```

Or via CLI:

```bash
hf inference-endpoints describe my-endpoint-name
```

It's a dataclass that holds information about the endpoint. You can access important attributes such as `name`, `repository`, `status`, `task`, `created_at`, `updated_at`, etc. If you need it, you can also access the raw response from the server with `endpoint.raw`.

Once your Inference Endpoint is created, you can find it on your [personal dashboard](https://ui.endpoints.huggingface.co/).
Expand Down Expand Up @@ -101,6 +114,14 @@ InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2
[InferenceEndpoint(name='aws-starchat-beta', namespace='huggingface', repository='HuggingFaceH4/starchat-beta', status='paused', url=None), ...]
```

Or via CLI:

```bash
hf inference-endpoints describe my-endpoint-name
hf inference-endpoints list --namespace huggingface
hf inference-endpoints list --namespace '*'
```

## Check deployment status

In the rest of this guide, we will assume that we have a [`InferenceEndpoint`] object called `endpoint`. You might have noticed that the endpoint has a `status` attribute of type [`InferenceEndpointStatus`]. When the Inference Endpoint is deployed and accessible, the status should be `"running"` and the `url` attribute is set:
Expand All @@ -117,6 +138,12 @@ Before reaching a `"running"` state, the Inference Endpoint typically goes throu
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)
```

Or via CLI:

```bash
hf inference-endpoints inspect my-endpoint-name
```

Instead of fetching the Inference Endpoint status while waiting for it to run, you can directly call [`~InferenceEndpoint.wait`]. This helper takes as input a `timeout` and a `fetch_every` parameter (in seconds) and will block the thread until the Inference Endpoint is deployed. Default values are respectively `None` (no timeout) and `5` seconds.

```py
Expand Down Expand Up @@ -189,6 +216,14 @@ InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2
# Endpoint is not 'running' but still has a URL and will restart on first call.
```

Or via CLI:

```bash
hf inference-endpoints pause my-endpoint-name
hf inference-endpoints resume my-endpoint-name
hf inference-endpoints scale-to-zero my-endpoint-name
```

### Update model or hardware requirements

In some cases, you might also want to update your Inference Endpoint without creating a new one. You can either update the hosted model or the hardware requirements to run the model. You can do this using [`~InferenceEndpoint.update`]:
Expand All @@ -207,6 +242,14 @@ InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2-large', status='pending', url=None)
```

Or via CLI:

```bash
hf inference-endpoints update my-endpoint-name --repo gpt2-large
hf inference-endpoints update my-endpoint-name --min-replica 2 --max-replica 6
hf inference-endpoints update my-endpoint-name --accelerator cpu --instance-size x4 --instance-type intel-icl
```

### Delete the endpoint

Finally if you won't use the Inference Endpoint anymore, you can simply call [`~InferenceEndpoint.delete()`].
Expand Down
3 changes: 0 additions & 3 deletions src/huggingface_hub/cli/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,6 @@
from ._cli_utils import RepoIdArg, RepoTypeOpt, RevisionOpt, TokenOpt


logger = logging.get_logger(__name__)


def download(
repo_id: RepoIdArg,
filenames: Annotated[
Expand Down
2 changes: 2 additions & 0 deletions src/huggingface_hub/cli/hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from huggingface_hub.cli.auth import auth_cli
from huggingface_hub.cli.cache import cache_cli
from huggingface_hub.cli.download import download
from huggingface_hub.cli.inference_endpoints import app as inference_endpoints_cli
from huggingface_hub.cli.jobs import jobs_cli
from huggingface_hub.cli.lfs import lfs_enable_largefiles, lfs_multipart_upload
from huggingface_hub.cli.repo import repo_cli
Expand Down Expand Up @@ -48,6 +49,7 @@
app.add_typer(repo_cli, name="repo")
app.add_typer(repo_files_cli, name="repo-files")
app.add_typer(jobs_cli, name="jobs")
app.add_typer(inference_endpoints_cli, name="inference-endpoints")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
app.add_typer(inference_endpoints_cli, name="inference-endpoints")
app.add_typer(inference_endpoints_cli, name="endpoints")

what do you think of renaming everything to hf endpoints? It is slightly less explicitly but much simpler to type and copy-paste IMO



def main():
Expand Down
Loading