Skip to content

Commit 9b7244f

Browse files
EthanV431stevhliu
andauthored
standardized YOLOS model card according to template in #36979 (#39528)
* standardized YOLOS model card according to template in #36979 * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * standardized YOLOS model card according to template in #36979 * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/yolos.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * replaced YOLOS architecture image, deleted quantization and AttentionMaskVisualizer sections * removed cli section * Update yolos.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent ec8a09a commit 9b7244f

File tree

1 file changed

+64
-45
lines changed

1 file changed

+64
-45
lines changed

docs/source/en/model_doc/yolos.md

Lines changed: 64 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -13,76 +13,95 @@ specific language governing permissions and limitations under the License.
1313
rendered properly in your Markdown viewer.
1414
1515
-->
16-
17-
# YOLOS
18-
19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
16+
<div style="float: right;">
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
19+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
20+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
</div>
2322
</div>
2423

25-
## Overview
24+
# YOLOS
2625

27-
The YOLOS model was proposed in [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://huggingface.co/papers/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
28-
YOLOS proposes to just leverage the plain [Vision Transformer (ViT)](vit) for object detection, inspired by DETR. It turns out that a base-sized encoder-only Transformer can also achieve 42 AP on COCO, similar to DETR and much more complex frameworks such as Faster R-CNN.
26+
[YOLOS](https://huggingface.co/papers/2106.00666) uses a [Vision Transformer (ViT)](./vit) for object detection with minimal modifications and region priors. It can achieve performance comparable to specialized object detection models and frameworks with knowledge about 2D spatial structures.
2927

30-
The abstract from the paper is the following:
3128

32-
*Can Transformer perform 2D object- and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain 42.0 box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS.*
29+
You can find all the original YOLOS checkpoints under the [HUST Vision Lab](https://huggingface.co/hustvl/models?search=yolos) organization.
3330

34-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png"
35-
alt="drawing" width="600"/>
31+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png" alt="drawing" width="600"/>
3632

3733
<small> YOLOS architecture. Taken from the <a href="https://huggingface.co/papers/2106.00666">original paper</a>.</small>
3834

39-
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/hustvl/YOLOS).
4035

41-
## Using Scaled Dot Product Attention (SDPA)
36+
> [!TIP]
37+
> This model wasa contributed by [nielsr](https://huggingface.co/nielsr).
38+
> Click on the YOLOS models in the right sidebar for more examples of how to apply YOLOS to different object detection tasks.
4239
43-
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
44-
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
45-
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
46-
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
47-
page for more information.
40+
The example below demonstrates how to detect objects with [`Pipeline`] or the [`AutoModel`] class.
4841

49-
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
50-
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
42+
<hfoptions id="usage">
43+
<hfoption id="Pipeline">
5144

52-
```
53-
from transformers import AutoModelForObjectDetection
54-
model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-base", attn_implementation="sdpa", torch_dtype=torch.float16)
55-
...
45+
```py
46+
import torch
47+
from transformers import pipeline
48+
49+
detector = pipeline(
50+
task="object-detection",
51+
model="hustvl/yolos-base",
52+
torch_dtype=torch.float16,
53+
device=0
54+
)
55+
detector("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
5656
```
5757

58-
For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
58+
</hfoption>
59+
<hfoption id="Automodel">
5960

60-
On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `hustvl/yolos-base` model, we saw the following speedups during inference.
61+
```py
62+
import torch
63+
from PIL import Image
64+
import requests
65+
from transformers import AutoImageProcessor, AutoModelForObjectDetection
6166

62-
| Batch size | Average inference time (ms), eager mode | Average inference time (ms), sdpa model | Speed up, Sdpa / Eager (x) |
63-
|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
64-
| 1 | 106 | 76 | 1.39 |
65-
| 2 | 154 | 90 | 1.71 |
66-
| 4 | 222 | 116 | 1.91 |
67-
| 8 | 368 | 168 | 2.19 |
67+
processor = AutoImageProcessor.from_pretrained("hustvl/yolos-base")
68+
model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-base", torch_dtype=torch.float16, attn_implementation="sdpa").to("cuda")
6869

69-
## Resources
70+
url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png"
71+
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
72+
inputs = processor(images=image, return_tensors="pt").to("cuda")
73+
74+
with torch.no_grad():
75+
outputs = model(**inputs)
76+
logits = outputs.logits.softmax(-1)
77+
scores, labels = logits[..., :-1].max(-1)
78+
boxes = outputs.pred_boxes
79+
80+
threshold = 0.3
81+
keep = scores[0] > threshold
7082

71-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with YOLOS.
83+
filtered_scores = scores[0][keep]
84+
filtered_labels = labels[0][keep]
85+
filtered_boxes = boxes[0][keep]
7286

73-
<PipelineTag pipeline="object-detection"/>
87+
width, height = image.size
88+
pixel_boxes = filtered_boxes * torch.tensor([width, height, width, height], device=boxes.device)
7489

75-
- All example notebooks illustrating inference + fine-tuning [`YolosForObjectDetection`] on a custom dataset can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS).
76-
- Scripts for finetuning [`YolosForObjectDetection`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/object-detection).
77-
- See also: [Object detection task guide](../tasks/object_detection)
90+
for score, label, box in zip(filtered_scores, filtered_labels, pixel_boxes):
91+
x0, y0, x1, y1 = box.tolist()
92+
print(f"Label {model.config.id2label[label.item()]}: {score:.2f} at [{x0:.0f}, {y0:.0f}, {x1:.0f}, {y1:.0f}]")
93+
```
94+
95+
</hfoption>
96+
</hfoptions>
7897

79-
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
8098

81-
<Tip>
99+
## Notes
100+
- Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](./detr), YOLOS doesn't require a `pixel_mask`.
82101

83-
Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
102+
## Resources
84103

85-
</Tip>
104+
- Refer to these [notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS) for inference and fine-tuning with [`YolosForObjectDetection`] on a custom dataset.
86105

87106
## YolosConfig
88107

0 commit comments

Comments
 (0)