Skip to content

Commit a9e166b

Browse files
deliahugitbook-bot
authored andcommitted
GitBook: [0.22] 60 pages modified
1 parent 60a504f commit a9e166b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+1984
-508
lines changed

README.md

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
Cortex is an open source platform for deploying, managing, and scaling machine learning in production.
44

5-
<br>
6-
75
## Model serving infrastructure
86

97
* Supports deploying TensorFlow, PyTorch, sklearn and other models as realtime or batch APIs
@@ -12,7 +10,7 @@ Cortex is an open source platform for deploying, managing, and scaling machine l
1210
* Runs inference on spot instances with on-demand backups
1311
* Manages traffic splitting for A/B testing
1412

15-
#### Configure your cluster:
13+
### Configure your cluster:
1614

1715
```yaml
1816
# cluster.yaml
@@ -26,7 +24,7 @@ max_instances: 100
2624
spot: true
2725
```
2826
29-
#### Spin up your cluster on your AWS account:
27+
### Spin up your cluster on your AWS account:
3028
3129
```text
3230
$ cortex cluster up --config cluster.yaml
@@ -39,16 +37,14 @@ $ cortex cluster up --config cluster.yaml
3937
cortex is ready!
4038
```
4139

42-
<br>
43-
4440
## Reproducible model deployments
4541

4642
* Implement request handling in Python
4743
* Customize compute, autoscaling, and networking for each API
4844
* Package dependencies, code, and configuration for reproducible deployments
4945
* Test locally before deploying to your cluster
5046

51-
#### Implement a predictor:
47+
### Implement a predictor:
5248

5349
```python
5450
# predictor.py
@@ -63,7 +59,7 @@ class PythonPredictor:
6359
return self.model(payload["text"])[0]
6460
```
6561

66-
#### Configure an API:
62+
### Configure an API:
6763

6864
```yaml
6965
# cortex.yaml
@@ -82,7 +78,7 @@ networking:
8278
api_gateway: public
8379
```
8480
85-
#### Deploy to production:
81+
### Deploy to production:
8682
8783
```text
8884
$ cortex deploy cortex.yaml
@@ -96,16 +92,14 @@ $ curl https://example.com/text-generator \
9692
"deploy machine learning models to production"
9793
```
9894

99-
<br>
100-
10195
## API management
10296

10397
* Monitor API performance
10498
* Aggregate and stream logs
10599
* Customize prediction tracking
106100
* Update APIs without downtime
107101

108-
#### Manage your APIs:
102+
### Manage your APIs:
109103

110104
```text
111105
$ cortex get
@@ -121,12 +115,11 @@ batch api running jobs last update
121115
image-classifier 5 10h
122116
```
123117

124-
<br>
125-
126118
## Get started
127119

128120
```text
129121
$ pip install cortex
130122
```
131123

132124
See the [installation guide](https://docs.cortex.dev/install) for next steps.
125+

docs/advanced/compute.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Compute
2+
3+
Compute resource requests in Cortex follow the syntax and meaning of [compute resources in Kubernetes](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container).
4+
5+
For example:
6+
7+
```yaml
8+
- name: my-api
9+
...
10+
compute:
11+
cpu: 1
12+
gpu: 1
13+
mem: 1G
14+
```
15+
16+
CPU, GPU, Inf, and memory requests in Cortex correspond to compute resource requests in Kubernetes. In the example above, the API will only be scheduled once 1 CPU, 1 GPU, and 1G of memory are available on any instance, and it will be guaranteed to have access to those resources throughout its execution. In some cases, resource requests can be \(or may default to\) `Null`.
17+
18+
## CPU
19+
20+
One unit of CPU corresponds to one virtual CPU on AWS. Fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix \(`0.2` and `200m` are equivalent\).
21+
22+
## GPU
23+
24+
One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed.
25+
26+
See [GPU documentation](gpus.md) for more information.
27+
28+
## Memory
29+
30+
One unit of memory is one byte. Memory can be expressed as an integer or by using one of these suffixes: `K`, `M`, `G`, `T` \(or their power-of two counterparts: `Ki`, `Mi`, `Gi`, `Ti`\). For example, the following values represent roughly the same memory: `128974848`, `129e6`, `129M`, `123Mi`.
31+
32+
## Inf
33+
34+
One unit of Inf corresponds to one Inferentia ASIC with 4 NeuronCores _\(not the same thing as `cpu`\)_ and 8GB of cache memory _\(not the same thing as `mem`\)_. Fractional requests are not allowed.
35+

docs/advanced/gpus.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Using GPUs
2+
3+
To use GPUs:
4+
5+
1. Make sure your AWS account is subscribed to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM).
6+
2. You may need to [file an AWS support ticket](https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances) to increase the limit for your desired instance type.
7+
3. Set instance type to an AWS GPU instance \(e.g. `g4dn.xlarge`\) when installing Cortex.
8+
4. Set the `gpu` field in the `compute` configuration for your API. One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed.
9+
10+
## Tips
11+
12+
### If using `processes_per_replica` &gt; 1, TensorFlow-based models, and Python Predictor
13+
14+
When using `processes_per_replica` &gt; 1 with TensorFlow-based models \(including Keras\) in the Python Predictor, loading the model in separate processes at the same time will throw a `CUDA_ERROR_OUT_OF_MEMORY: out of memory` error. This is because the first process that loads the model will allocate all of the GPU's memory and leave none to other processes. To prevent this from happening, the per-process GPU memory usage can be limited. There are two methods:
15+
16+
1\) Configure the model to allocate only as much memory as it requires, via [tf.config.experimental.set\_memory\_growth\(\)](https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth):
17+
18+
```python
19+
for gpu in tf.config.list_physical_devices("GPU"):
20+
tf.config.experimental.set_memory_growth(gpu, True)
21+
```
22+
23+
2\) Impose a hard limit on how much memory the model can use, via [tf.config.set\_logical\_device\_configuration\(\)](https://www.tensorflow.org/api_docs/python/tf/config/set_logical_device_configuration):
24+
25+
```python
26+
mem_limit_mb = 1024
27+
for gpu in tf.config.list_physical_devices("GPU"):
28+
tf.config.set_logical_device_configuration(
29+
gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=mem_limit_mb)]
30+
)
31+
```
32+
33+
See the [TensorFlow GPU guide](https://www.tensorflow.org/guide/gpu) and this [blog post](https://medium.com/@starriet87/tensorflow-2-0-wanna-limit-gpu-memory-10ad474e2528) for additional information.
34+

docs/advanced/inferentia.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Using Inferentia
2+
3+
To use [Inferentia ASICs](https://aws.amazon.com/machine-learning/inferentia/):
4+
5+
1. You may need to [file an AWS support ticket](https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances) to increase the limit for your desired instance type.
6+
2. Set the instance type to an AWS Inferentia instance \(e.g. `inf1.xlarge`\) when creating your Cortex cluster.
7+
3. Set the `inf` field in the `compute` configuration for your API. One unit of `inf` corresponds to one Inferentia ASIC with 4 NeuronCores _\(not the same thing as `cpu`\)_ and 8GB of cache memory _\(not the same thing as `mem`\)_. Fractional requests are not allowed.
8+
9+
## Neuron
10+
11+
Inferentia ASICs come in different sizes depending on the instance type:
12+
13+
* `inf1.xlarge`/`inf1.2xlarge` - each has 1 Inferentia ASIC
14+
* `inf1.6xlarge` - has 4 Inferentia ASICs
15+
* `inf1.24xlarge` - has 16 Inferentia ASICs
16+
17+
Each Inferentia ASIC comes with 4 NeuronCores and 8GB of cache memory. To better understand how Inferentia ASICs work, read these [technical notes](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/README.md) and this [FAQ](https://github.com/aws/aws-neuron-sdk/blob/master/FAQ.md).
18+
19+
### NeuronCore Groups
20+
21+
A [NeuronCore Group](https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/tutorial-NeuronCore-Group.md) \(NCG\) is a set of NeuronCores that is used to load and run a compiled model. NCGs exist to aggregate NeuronCores to improve hardware performance. Models can be shared within an NCG, but this would require the device driver to dynamically context switch between each model, which degrades performance. Therefore we've decided to only allow one model per NCG \(unless you are using a [multi-model endpoint](../guides/multi-model.md), in which case there will be multiple models on a single NCG, and there will be context switching\).
22+
23+
Each Cortex API process will have its own copy of the model and will run on its own NCG \(the number of API processes is configured by the [`processes_per_replica`](../deployments/realtime-api/autoscaling.md#replica-parallelism) for Realtime APIs field in the API configuration\). Each NCG will have an equal share of NeuronCores. Therefore, the size of each NCG will be `4 * inf / processes_per_replica` \(`inf` refers to your API's `compute` request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip\).
24+
25+
For example, if your API requests 2 `inf` chips, there will be 8 NeuronCores available. If you set `processes_per_replica` to 1, there will be one copy of your model running on a single NCG of size 8 NeuronCores. If `processes_per_replica` is 2, there will be two copies of your model, each running on a separate NCG of size 4 NeuronCores. If `processes_per_replica` is 4, there will be 4 NCGs of size 2 NeuronCores, and if If `processes_per_replica` is 8, there will be 8 NCGs of size 1 NeuronCores. In this scenario, these are the only valid values for `processes_per_replica`. In other words the total number of requested NeuronCores \(which equals 4 \* the number of requested Inferentia chips\) must be divisible by `processes_per_replica`.
26+
27+
The 8GB cache memory is shared between all 4 NeuronCores of an Inferentia chip. Therefore an NCG with 8 NeuronCores \(i.e. 2 Inf chips\) will have access to 16GB of cache memory. An NGC with 2 NeuronCores will have access to 8GB of cache memory, which will be shared with the other NGC of size 2 running on the same Inferentia chip.
28+
29+
### Compiling models
30+
31+
Before a model can be deployed on Inferentia chips, it must be compiled for Inferentia. The Neuron compiler can be used to convert a regular TensorFlow SavedModel or PyTorch model into the hardware-specific instruction set for Inferentia. Inferentia currently supports compiled models from TensorFlow and PyTorch.
32+
33+
By default, the Neuron compiler will compile a model to use 1 NeuronCore, but can be manually set to a different size \(1, 2, 4, etc\).
34+
35+
For optimal performance, your model should be compiled to run on the number of NeuronCores available to it. The number of NeuronCores will be `4 * inf / processes_per_replica` \(`inf` refers to your API's `compute` request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip\). See [NeuronCore Groups](inferentia.md#neuron-core-groups) above for an example, and see [Improving performance](inferentia.md#improving-performance) below for a discussion of choosing the appropriate number of NeuronCores.
36+
37+
Here is an example of compiling a TensorFlow SavedModel for Inferentia:
38+
39+
```python
40+
import tensorflow.neuron as tfn
41+
42+
tfn.saved_model.compile(
43+
model_dir,
44+
compiled_model_dir,
45+
batch_size,
46+
compiler_args=["--num-neuroncores", "1"],
47+
)
48+
```
49+
50+
Here is an example of compiling a PyTorch model for Inferentia:
51+
52+
```python
53+
import torch_neuron, torch
54+
55+
model.eval()
56+
example_input = torch.zeros([batch_size] + input_shape, dtype=torch.float32)
57+
model_neuron = torch.neuron.trace(
58+
model,
59+
example_inputs=[example_input],
60+
compiler_args=["--num-neuroncores", "1"]
61+
)
62+
model_neuron.save(compiled_model)
63+
```
64+
65+
The versions of `tensorflow-neuron` and `torch-neuron` that are used by Cortex are found in the [Realtime API pre-installed packages list](../deployments/realtime-api/predictors.md#inferentia-equipped-apis) and [Batch API pre-installed packages list](../deployments/batch-api/predictors.md#inferentia-equipped-apis). When installing these packages with `pip` to compile models of your own, use the extra index URL `--extra-index-url=https://pip.repos.neuron.amazonaws.com`.
66+
67+
A list of model compilation examples for Inferentia can be found on the [`aws/aws-neuron-sdk`](https://github.com/aws/aws-neuron-sdk) repo for [TensorFlow](https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/) and for [PyTorch](https://github.com/aws/aws-neuron-sdk/blob/master/docs/pytorch-neuron/README.md). Here are 2 examples implemented with Cortex:
68+
69+
1. [ResNet50 in TensorFlow](https://github.com/cortexlabs/cortex/tree/0.22/examples/tensorflow/image-classifier-resnet50)
70+
2. [ResNet50 in PyTorch](https://github.com/cortexlabs/cortex/tree/0.22/examples/pytorch/image-classifier-resnet50)
71+
72+
### Improving performance
73+
74+
A few things can be done to improve performance using compiled models on Cortex:
75+
76+
1. There's a minimum number of NeuronCores for which a model can be compiled. That number depends on the model's architecture. Generally, compiling a model for more cores than its required minimum helps to distribute the model's operators across multiple cores, which in turn [can lead to lower latency](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/neuroncore-pipeline.md). However, compiling a model for more NeuronCores means that you'll have to set `processes_per_replica` to be lower so that the NeuronCore Group has access to the number of NeuronCores for which you compiled your model. This is acceptable if latency is your top priority, but if throughput is more important to you, this tradeoff is usually not worth it. To maximize throughput, compile your model for as few NeuronCores as possible and increase `processes_per_replica` to the maximum possible \(see above for a sample calculation\).
77+
2. Try to achieve a near [100% placement](https://github.com/aws/aws-neuron-sdk/blob/b28262e3072574c514a0d72ad3fe5ca48686d449/src/examples/tensorflow/keras_resnet50/pb2sm_compile.py#L59) of your model's graph onto the NeuronCores. During the compilation phase, any operators that can't execute on NeuronCores will be compiled to execute on the machine's CPU and memory instead. Even if just a few percent of the operations reside on the host's CPU/memory, the maximum throughput of the instance can be significantly limited.
78+
3. Use the [`--static-weights` compiler option](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/performance-tuning.md#compiling-for-pipeline-optimization) when possible. This option tells the compiler to make it such that the entire model gets cached onto the NeuronCores. This avoids a lot of back-and-forth between the machine's CPU/memory and the Inferentia ASICs.
79+

docs/advanced/networking.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Networking
2+
3+
![api architecture diagram](https://user-images.githubusercontent.com/808475/84695323-8507dd00-aeff-11ea-8b32-5a55cef76c79.png)
4+
5+
APIs are deployed with a public API Gateway by default \(the API Gateway forwards requests to the API load balancer\). Each API can be independently configured to not create the API Gateway endpoint by setting `api_gateway: none` in the `networking` field of the [Realtime API configuration](../deployments/realtime-api/api-configuration.md) and [Batch API configuration](../deployments/batch-api/api-configuration.md). If the API Gateway endpoint is not created, your API can still be accessed via the API load balancer; `cortex get API_NAME` will show the load balancer endpoint if API Gateway is disabled. API Gateway is enabled by default, and is generally recommended unless it doesn't support your use case due to limitations such as the 29 second request timeout, or if you are keeping your APIs private to your VPC. See below for common configurations. To disable API Gateway cluster-wide \(thereby enforcing that all APIs cannot create API Gateway endpoints\), set `api_gateway: none` in your [cluster configuration](../cluster-management/config.md) file \(before creating your cluster\).
6+
7+
By default, the API load balancer is public. You can configure your API load balancer to be private by setting `api_load_balancer_scheme: internal` in your [cluster configuration](../cluster-management/config.md) file \(before creating your cluster\). This will force external traffic to go through your API Gateway endpoint, or if you disabled API Gateway for your API, it will make your API only accessible through VPC Peering. Note that if API Gateway is used, endpoints will be public regardless of `api_load_balancer_scheme`. See below for common configurations.
8+
9+
The API Gateway that Cortex creates in AWS is the "HTTP" type. If you need to use AWS's "REST" API Gateway, see [here](../guides/rest-api-gateway.md).
10+
11+
## Common API networking configurations
12+
13+
### Public https endpoint \(with API Gateway\)
14+
15+
This is the most common configuration for public APIs. [Custom domains](../guides/custom-domain.md) can be used with this setup, but are not required.
16+
17+
```yaml
18+
# cluster.yaml
19+
20+
api_load_balancer_scheme: internal
21+
```
22+
23+
```yaml
24+
# cortex.yaml
25+
26+
- name: my-api
27+
...
28+
networking:
29+
api_gateway: public # this is the default, so can be omitted
30+
```
31+
32+
### Private https endpoint
33+
34+
You can configure your API to be private. If you do this, you must use [VPC Peering](../guides/vpc-peering.md) to connect to your APIs.
35+
36+
The SSL certificate on the API load balancer is autogenerated during installation using `localhost` as the Common Name \(CN\). Therefore, clients will need to skip certificate verification when making HTTPS requests \(e.g. `curl -k`\). Alternatively, you can set up a [custom domain](../guides/custom-domain.md), which will use ACM to provision SSL certs for your domain.
37+
38+
```yaml
39+
# cluster.yaml
40+
41+
api_load_balancer_scheme: internal # this is the default, so can be omitted
42+
43+
# use this to configure a custom domain
44+
# if you don't use a custom domain, clients will need to skip certificate verification when making HTTPS requests (e.g. `curl -k`)
45+
ssl_certificate_arn: arn:aws:acm:us-west-2:***:certificate/***
46+
```
47+
48+
```yaml
49+
# cortex.yaml
50+
51+
- name: my-api
52+
...
53+
networking:
54+
api_gateway: none
55+
```
56+
57+
### Private http endpoint
58+
59+
You can configure your API to be private. If you do this, you must use [VPC Peering](../guides/vpc-peering.md) to connect to your APIs.
60+
61+
```yaml
62+
# cluster.yaml
63+
64+
api_load_balancer_scheme: internal # this is the default, so can be omitted
65+
```
66+
67+
```yaml
68+
# cortex.yaml
69+
70+
- name: my-api
71+
...
72+
networking:
73+
api_gateway: none
74+
```
75+
76+
### Public https endpoint \(without API Gateway\)
77+
78+
API gateway is generally recommended for public https APIs, but there may be a situation where you don't wish to use it \(e.g. requests take longer than 29 seconds to complete, which is the max for API Gateway\). In this case, clients can connect directly to the API load balancer.
79+
80+
The SSL certificate on the API load balancer is autogenerated during installation using `localhost` as the Common Name \(CN\). Therefore, clients will need to skip certificate verification when making HTTPS requests \(e.g. `curl -k`\). Alternatively, you can set up a [custom domain](../guides/custom-domain.md), which will use ACM to provision SSL certs for your domain.
81+
82+
```yaml
83+
# cluster.yaml
84+
85+
api_load_balancer_scheme: internet-facing # this is the default, so can be omitted
86+
87+
# use this to configure a custom domain
88+
# if you don't use a custom domain, clients will need to skip certificate verification when making HTTPS requests (e.g. `curl -k`)
89+
ssl_certificate_arn: arn:aws:acm:us-west-2:***:certificate/***
90+
```
91+
92+
```yaml
93+
# cortex.yaml
94+
95+
- name: my-api
96+
...
97+
networking:
98+
api_gateway: none
99+
```
100+
101+
### Public http endpoint
102+
103+
If you don't wish to use https for your public API, you can simply disable API gateway \(your API will be accessed directly via the API load balancer\):
104+
105+
```yaml
106+
# cluster.yaml
107+
108+
api_load_balancer_scheme: internet-facing # this is the default, so can be omitted
109+
```
110+
111+
```yaml
112+
# cortex.yaml
113+
114+
- name: my-api
115+
...
116+
networking:
117+
api_gateway: none
118+
```
119+

0 commit comments

Comments
 (0)