From 48efd5808043ac5fb699d08afd374e20a180767c Mon Sep 17 00:00:00 2001
From: D15M4S <dismas5184@gmail.com>
Date: Sat, 19 Jul 2025 15:35:39 +0900
Subject: [PATCH 1/8] docs: ko: perf_train_gpu_one.md

---
 docs/source/ko/_toctree.yml          |   4 +-
 docs/source/ko/perf_train_gpu_one.md | 296 +++++++++++++++++++++++++++
 2 files changed, 298 insertions(+), 2 deletions(-)
 create mode 100644 docs/source/ko/perf_train_gpu_one.md

diff --git a/docs/source/ko/_toctree.yml b/docs/source/ko/_toctree.yml
index 75632c1f59a5..f1b649e053e0 100644
--- a/docs/source/ko/_toctree.yml
+++ b/docs/source/ko/_toctree.yml
@@ -176,8 +176,8 @@
   - local: llm_optims
     title: LLM 추론 최적화
   - sections:
-    - local: in_translation
-      title: (번역중) Methods and tools for efficient training on a single GPU
+    - local: perf_train_gpu_one
+      title: GPU
     - local: perf_train_gpu_many
       title: 다중 GPU에서 훈련 진행하기
     - local: deepspeed
diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
new file mode 100644
index 000000000000..e8cc6ec7579e
--- /dev/null
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -0,0 +1,296 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# GPU
+
+GPUs are commonly used to train deep learning models due to their high memory bandwidth and parallel processing capabilities. Depending on your GPU and model size, it is possible to even train models with billions of parameters. The key is to find the right balance between GPU memory utilization (data throughput/training time) and training speed.
+
+This guide will show you the features available in Transformers and PyTorch for efficiently training a model on GPUs. In many cases, you'll want to use a combination of these features to optimize training.
+
+Refer to the table below to quickly help you identify the features relevant to your training scenario.
+
+| Feature | Training speed | Memory usage |
+|---|---|---|
+| batch size | yes | yes |
+| gradient accumulation | no | yes |
+| gradient checkpointing | no | yes |
+| mixed precision | yes | depends |
+| optimizers | yes | yes |
+| data preloading | yes | no |
+| torch_empty_cache_steps | no | yes |
+| torch.compile | yes | no |
+| scaled dot production attention (SDPA) | yes | yes |
+
+## Trainer
+
+[Trainer](./trainer) supports many useful training features that can be configured through [`TrainingArguments`]. This section highlights some of the more important features for optimizing training.
+
+### Batch size
+
+Batch size is one of the most important hyperparameters for efficient GPU training because it affects memory usage and training speed. Larger batch sizes lead to faster training because it takes advantage of a GPUs parallel processing power. It is recommended to use batch sizes that are powers of 2, such as 8, 64, 128, 256, 512, etc. The batch size depends on your GPU and the models data type.
+
+Configure [`~TrainingArguments.per_device_train_batch_size`] in [`TrainingArguments`].
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=256,
+    per_device_eval_batch_size=256,
+)
+```
+
+Refer to the NVIDIA [Performance](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features) guide to learn more about how input features and output neuron counts and batch size affect performance. These are involved in the General Matrix Multiplications (GEMMs) performed by the GPU. Larger parameters are better for parallelization and efficiency.
+
+The [Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) section is also useful for selecting a batch size that maximizes the speed of tensor multiplication based on the data type and GPU. For example, multiples of 8 are recommended for fp16, unless it's an A100 GPU, in which case use multiples of 64.
+
+Finally, consider [Dimension Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization) for smaller parameters. Tile quantization results when matrix dimensions aren't divisible by a GPUs thread block tile size, causing the GPU to underutilize its resources. Selecting the correct batch size multiplier, such that the matrix is divisible by the tile size, can significantly speed up training.
+
+### Gradient accumulation
+
+Gradient accumulation overcomes memory constraints - useful for fitting a very large model that otherwise wouldn't fit on a single GPU - by accumulating gradients over multiple mini-batches before updating the parameters. This reduces memory by storing fewer gradients and enables training with a larger *effective batch size* because usually, the parameters are updated from a single batch of data. Training can slow down though due to the additional forward and backward passes introduced by gradient accumulation.
+
+Configure [`~TrainingArguments.per_device_train_batch_size`] in [`TrainingArguments`] to enable gradient accumulation.
+
+```py
+from transformers import TrainingArguments
+
+# effective batch size of 64
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+)
+```
+
+Try to avoid too many gradient accumulation steps because it can really slow down training. Consider the example below, where the maximum batch size that'll fit on your GPU is 4. You should keep your batch size at 4 to better utilize the GPU.
+
+| batch size | gradient accumulation steps | effective batch size |  |
+|---|---|---|---|
+| 1 | 64 | 64 | 👎 |
+| 4 | 16 | 64 | 👍 |
+
+### Gradient checkpointing
+
+Gradient checkpointing reduces memory usage by only storing some of the intermediate activations during the backward pass and recomputing the remaining activations. This avoids storing *all* of the intermediate activations from the forward pass, which can require a lot of memory overhead. However, it comes at the cost of slower training speed (~20%).
+
+Configure [`~TrainingArguments.gradient_checkpointing`] in [`TrainingArguments`] to enable gradient checkpointing.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+)
+```
+
+### Mixed precision
+
+Mixed precision accelerates training speed by performing some calculations in half-precision (fp16) and some in full-precision (fp32). The half-precision calculations boosts training speed because it's not as computationally expensive as performing the calculations in full-precision. Meanwhile, preserving some of the calculations in full-precision maintains accuracy.
+
+There are several data types available for mixed precision training.
+
+<hfoptions id="mixed-precision">
+<hfoption id="fp16">
+
+The main advantage of mixed precision training is saving the activations in fp16.
+
+Configure [`~TrainingArguments.fp16`] in [`TrainingArguments`] to enable mixed precision training with the fp16 data type.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    fp16=True.
+)
+```
+
+fp16 isn't memory-optimized because the gradients that are computed in fp16 are converted back to fp32 during the optimization step. You may end up using more GPU memory, especially for small batch sizes, because there are now two versions (fp16 and fp32) of the model on the GPU.
+
+</hfoption>
+<hfoption id="bf16">
+
+[bf16](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) trades off some precision for a much larger dynamic range, which is helpful for avoiding overflow and underflow errors. You can use bf16 without adding any loss scaling methods like you would with fp16. bf16 is supported by NVIDIAs Ampere architecture or newer.
+
+Configure [`~TrainingArguments.bf16`] in [`TrainingArguments`] to enable mixed precision training with the bf16 data type.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    bf16=True,
+)
+```
+
+</hfoption>
+<hfoption id="tf32">
+
+[tf32](https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/) is a mode on NVIDIA Ampere GPUs that convert the convolution and matrix multiplication inputs to tf32. All other storage and operations are kept in fp32. This allows tf32 to maintain the same range as fp32, the same precision as fp16 and more precision than bf16. Combining tf32 with fp16 or bf16 mixed precision training can improve throughput by 16x.
+
+tf32 is enabled by default on NVIDIA Ampere GPUs, but you can also add the code below to your fp32 training or inference code to explicitly enable it.
+
+```py
+import torch
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+```
+
+Configure [tf32()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.tf32) in [`TrainingArguments`] to enable mixed precision training with tf32 mode.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    bf16=True.
+    tf32=True,
+)
+```
+
+</hfoption>
+</hfoptions>
+
+### Optimizers
+
+Transformers implements the [AdamW (adamw_torch)](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer from PyTorch by default. But because it stores a weighted average of past gradients, it requires additional memory proportional to the number of model parameters to store the past gradients. This can be an issue when training very large models, and in such cases, you should consider choosing a different optimizer. For example, if you have [Apex](https://nvidia.github.io/apex/index.html) installed on either [NVIDIA](https://github.com/NVIDIA/apex) or [AMD](https://github.com/ROCm/apex), then using the `adamw_apex_fused` optimizer provides the fastest training for all AdamW optimizers.
+
+Configure [`~TrainingArguments.optim`] in [`TrainingArguments`] to choose an optimizer.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    bf16=True,
+    optim="adamw_bnb_8bit"
+)
+```
+
+There are many optimizers to choose from (refer to [OptimizerNames](https://github.com/huggingface/transformers/blob/34f4080ff59b1668d919a1ba9f8bc4a3a2a3f478/src/transformers/training_args.py#L145) for a full supported list) depending on your training scenario. For example, Adafactor can significantly reduce memory requirements by storing a weighted average of a row or column instead of each element in the matrix at the cost of slower convergence. Another example is using a [8-bit AdamW optimizer](https://huggingface.co/docs/bitsandbytes) from bitsandbytes to quantize optimizer states. The optimizer state is stored in a lower precision and dequantized before being used in the optimizer step.
+
+Refer to the [optimizer](./optimizers) guide for to learn about more specialized optimizers.
+
+### Data preloading
+
+Data preloading loads and prepares batches of data in advance on the CPU to ensure the GPU is continuously working, reducing GPU idling and increasing utilization. There are two ways to preload data to ensure the GPU is always working.
+
+1. Allocate pinned memory on the CPU to store the data and transfer it directly to the GPU.
+2. Increase the number of CPU threads or workers to preload the data faster.
+
+Configure [`~TrainingArguments.dataloader_pin_memory`] and [`~TrainingArguments.dataloader_num_workers`] in [`TrainingArguments`] to allocate pinned memory and increase the number of workers.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    bf16=True,
+    optim="adamw_bnb_8bit",
+    dataloader_pin_memory=True,
+    dataloader_num_workers=4,
+)
+```
+
+## PyTorch
+
+PyTorch provides several features for reducing memory requirements and increasing training speed. These features can often be enabled in Transformers by only adding a few lines of code.
+
+### torch.empty_cache_steps
+
+The [torch.cuda.empty_cache](https://pytorch.org/docs/stable/generated/torch.cuda.empty_cache.html#torch.cuda.empty_cache) function releases unused cached memory, which can help avoid out-of-memory (OOM) errors at the cost of ~10% slower training.
+
+Use [torch_empty_cache_steps()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.torch_empty_cache_steps) in [`TrainingArguments`] to enable it after a certain number of training steps.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    bf16=True,
+    optim="adamw_bnb_8bit",
+    dataloader_pin_memory=True,
+    dataloader_num_workers=4,
+    torch_empty_cache_steps=4,
+)
+```
+
+### torch.compile
+
+[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) compiles PyTorch code into optimized kernels that significantly speed up training. This feature relies on TorchDynamo to capture PyTorch graphs with the Frame Evaluation API. The graph can be further compiled into optimized kernels for different backends.
+
+Configure [`~TrainingArguments.torch_compile`] in [`TrainingArguments`] to enable it, and configure [torch_compile_backend()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.torch_compile_backend) to select a backend to use.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=16,
+    gradient_checkpointing=True,
+    bf16=True,
+    optim="adamw_bnb_8bit",
+    dataloader_pin_memory=True,
+    dataloader_num_workers=4,
+    torch_empty_cache_steps=4,
+    torch_compile=True,
+    torch_compile_backend="inductor"
+)
+```
+
+Refer to the table below to help you choose the right backend for your training scenario.
+
+| backend | description | goal |
+|---|---|---|
+| eager | uses PyTorch to run extracted GraphModule | debugging |
+| aot_eager | uses PyTorch eager mode for AOTAutograd's extracted forward and backward graphs | debugging |
+| inductor | uses TorchInductor with AOTAutograd and CUDA Graphs by leveraging Triton kernels | training and inference |
+| nvfuser | uses nvFuser with TorchScript | training and inference |
+| aot_nvfuser | uses nvFuser with AOTAutograd | training and inference |
+| aot_cudagraphs | uses CUDA Graphs with AOTAutograd | training and inference |
+| ofi | uses TorchScripts [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html#torch-jit-optimize-for-inference) | inference |
+| fx2trt | uses [Torch-TensorRT](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) | inference |
+| onnxrt | uses [ONNX-RT](https://onnxruntime.ai/) for CPU and GPU inference | inference |
+| ipex | uses [IPEX](https://github.com/intel/intel-extension-for-pytorch) for CPU inference | inference |
+
+### Scaled dot production attention
+
+[torch.nn.functional.scaled_dot_product_attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) is a native PyTorch implementation of the scaled dot product attention mechanism. SDPA is more efficient and optimized than the original attention mechanism in transformer models. It supports three types of scaled dot product attention.
+
+- [FlashAttention2](https://github.com/Dao-AILab/flash-attention) is automatically enabled for models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate type first.
+- [xFormers](https://github.com/facebookresearch/xformers) or Memory-Efficient Attention supports models with the fp32 torch type.
+- C++ implementation of scaled dot product attention.
+
+SDPA is enabled by default for PyTorch 2.1.1+, but it can be explicitly enabled by setting `attn_implementation="sdpa"` in [`~PreTrainedModel.from_pretrained`].
+
+```py
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")
+```

From 0fa1ff53177dbae5cf1ef7ed8aff7493cee78cb6 Mon Sep 17 00:00:00 2001
From: D15M4S <dismas5184@gmail.com>
Date: Mon, 21 Jul 2025 20:01:00 +0900
Subject: [PATCH 2/8] feat: nmt draft

---
 docs/source/ko/perf_train_gpu_one.md | 204 +++++++++++++--------------
 1 file changed, 96 insertions(+), 108 deletions(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index e8cc6ec7579e..a484aaa1b991 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -13,37 +13,37 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# GPU
+# GPU[[gpu]]
 
-GPUs are commonly used to train deep learning models due to their high memory bandwidth and parallel processing capabilities. Depending on your GPU and model size, it is possible to even train models with billions of parameters. The key is to find the right balance between GPU memory utilization (data throughput/training time) and training speed.
+GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 모델 학습에 널리 사용됩니다. GPU와 모델 크기에 따라 수십억 개의 매개변수를 가진 모델도 학습할 수 있습니다. 핵심은 GPU 메모리 사용량(데이터 처리량/학습 시간)과 학습 속도 간의 균형을 잘 맞추는 것입니다.
 
-This guide will show you the features available in Transformers and PyTorch for efficiently training a model on GPUs. In many cases, you'll want to use a combination of these features to optimize training.
+이 가이드는 GPU에서 효율적으로 모델을 학습하기 위해 Transformers와 PyTorch에서 제공하는 기능을 설명합니다. 대부분의 경우, 여러 기능을 함께 사용해 학습을 최적화하는 것이 좋습니다.
 
-Refer to the table below to quickly help you identify the features relevant to your training scenario.
+아래 표를 참고하시면 학습 상황에 적합한 기능을 빠르게 확인하실 수 있습니다.
 
-| Feature | Training speed | Memory usage |
-|---|---|---|
-| batch size | yes | yes |
-| gradient accumulation | no | yes |
-| gradient checkpointing | no | yes |
-| mixed precision | yes | depends |
-| optimizers | yes | yes |
-| data preloading | yes | no |
-| torch_empty_cache_steps | no | yes |
-| torch.compile | yes | no |
-| scaled dot production attention (SDPA) | yes | yes |
+| 기능                              | 학습 속도 | 메모리 사용 |
+| --------------------------------- | --------- | ----------- |
+| 배치 크기                         | 예        | 예          |
+| 그래디언트 누적                   | 아니요    | 예          |
+| 그래디언트 체크포인팅             | 아니요    | 예          |
+| 혼합 정밀도                       | 예        | 경우에 따라 |
+| 옵티마이저                        | 예        | 예          |
+| 데이터 사전 적재                  | 예        | 아니요      |
+| torch_empty_cache_steps           | 아니요    | 예          |
+| torch.compile                     | 예        | 아니요      |
+| 스케일드 닷 프로덕션 어텐션(SDPA) | 예        | 예          |
 
-## Trainer
+## Trainer[[trainer]]
 
-[Trainer](./trainer) supports many useful training features that can be configured through [`TrainingArguments`]. This section highlights some of the more important features for optimizing training.
+[Trainer](./trainer)는 [`TrainingArguments`]를 통해 설정할 수 있는 다양한 유용한 학습 기능을 지원합니다. 이 섹션에서는 학습을 최적화하는 데 특히 중요한 기능들을 소개합니다.
 
-### Batch size
+### 배치 크기[[batch-size]]
 
-Batch size is one of the most important hyperparameters for efficient GPU training because it affects memory usage and training speed. Larger batch sizes lead to faster training because it takes advantage of a GPUs parallel processing power. It is recommended to use batch sizes that are powers of 2, such as 8, 64, 128, 256, 512, etc. The batch size depends on your GPU and the models data type.
+배치 크기는 GPU 학습 효율을 결정하는 가장 중요한 하이퍼파라미터 중 하나로, 메모리 사용량과 학습 속도에 영향을 줍니다. 더 큰 배치 크기는 GPU의 병렬 처리 능력을 활용하기 때문에 더 빠른 학습을 제공합니다. 8, 64, 128, 256, 512처럼 2의 거듭제곱 크기를 사용하는 것이 권장됩니다. 배치 크기는 GPU와 모델의 데이터 타입에 따라 달라집니다.
 
-Configure [`~TrainingArguments.per_device_train_batch_size`] in [`TrainingArguments`].
+[`TrainingArguments`]의 [`~TrainingArguments.per_device_train_batch_size`]를 설정하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -52,42 +52,35 @@ args = TrainingArguments(
 )
 ```
 
-Refer to the NVIDIA [Performance](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features) guide to learn more about how input features and output neuron counts and batch size affect performance. These are involved in the General Matrix Multiplications (GEMMs) performed by the GPU. Larger parameters are better for parallelization and efficiency.
+입력 피처, 출력 뉴런 수, 배치 크기가 성능에 어떻게 영향을 미치는지에 대해서는 NVIDIA Performance 가이드를 참고하세요. 이는 GPU가 수행하는 일반 행렬 곱(GEMM)과 관련이 있으며, 매개변수가 클수록 병렬화와 효율성이 높아집니다.
 
-The [Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) section is also useful for selecting a batch size that maximizes the speed of tensor multiplication based on the data type and GPU. For example, multiples of 8 are recommended for fp16, unless it's an A100 GPU, in which case use multiples of 64.
+Tensore Core Requirements도 데이터 타입과 GPU에 따라 텐서 곱셈 속도를 극대화할 수 있는 배치 크기 선택에 유용합니다. 예를 들어, fp16에서는 8의 배수가, A100 GPU에서는 64의 배수가 권장됩니다.
 
-Finally, consider [Dimension Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization) for smaller parameters. Tile quantization results when matrix dimensions aren't divisible by a GPUs thread block tile size, causing the GPU to underutilize its resources. Selecting the correct batch size multiplier, such that the matrix is divisible by the tile size, can significantly speed up training.
+마지막으로, 작은 매개변수에서는 Dimension Quantization Effects를 고려하세요. 행렬 차원이 GPU의 스레드 블록 타일 크기로 나누어떨어지지 않으면 GPU 자원이 충분히 활용되지 못할 수 있습니다. 적절한 배치 크기를 선택해 행렬이 타일 크기로 나누어지도록 하면 학습 속도를 크게 높일 수 있습니다.
 
-### Gradient accumulation
+### 그래디언트 누적[[gradient-accumulation]]
 
-Gradient accumulation overcomes memory constraints - useful for fitting a very large model that otherwise wouldn't fit on a single GPU - by accumulating gradients over multiple mini-batches before updating the parameters. This reduces memory by storing fewer gradients and enables training with a larger *effective batch size* because usually, the parameters are updated from a single batch of data. Training can slow down though due to the additional forward and backward passes introduced by gradient accumulation.
+그래디언트 누적은 메모리 제약을 극복해 단일 GPU에 맞지 않는 큰 모델을 학습할 수 있게 합니다. 여러 미니 배치에서 그래디언트를 누적한 뒤 파라미터를 업데이트하기 때문에 메모리를 절약하고 더 큰 유효 배치 크기로 학습할 수 있습니다. 다만, 추가적인 순전파 및 역전파가 필요하므로 학습 속도는 느려질 수 있습니다.
 
-Configure [`~TrainingArguments.per_device_train_batch_size`] in [`TrainingArguments`] to enable gradient accumulation.
+[TrainingArguments]의 [~TrainingArguments.per_device_train_batch_size]를 설정해 활성화하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
-# effective batch size of 64
+# 유효 배치 크기 64
 args = TrainingArguments(
     per_device_train_batch_size=4,
     gradient_accumulation_steps=16,
 )
 ```
 
-Try to avoid too many gradient accumulation steps because it can really slow down training. Consider the example below, where the maximum batch size that'll fit on your GPU is 4. You should keep your batch size at 4 to better utilize the GPU.
+### 그래디언트 체크포인팅[[gradient-checkpointing]]
 
-| batch size | gradient accumulation steps | effective batch size |  |
-|---|---|---|---|
-| 1 | 64 | 64 | 👎 |
-| 4 | 16 | 64 | 👍 |
+그래디언트 체크포인팅은 역전파 시 일부 중간 활성화만 저장하고 나머지는 다시 계산해 메모리 사용량을 줄입니다. 이렇게 하면 순전파의 모든 중간 활성화를 저장하지 않아도 되어 메모리 오버헤드를 줄일 수 있습니다. 단, 약 20% 정도 학습 속도가 느려집니다.
 
-### Gradient checkpointing
+[TrainingArguments]의 [~TrainingArguments.gradient_checkpointing]를 설정해 활성화하세요.
 
-Gradient checkpointing reduces memory usage by only storing some of the intermediate activations during the backward pass and recomputing the remaining activations. This avoids storing *all* of the intermediate activations from the forward pass, which can require a lot of memory overhead. However, it comes at the cost of slower training speed (~20%).
-
-Configure [`~TrainingArguments.gradient_checkpointing`] in [`TrainingArguments`] to enable gradient checkpointing.
-
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -97,20 +90,18 @@ args = TrainingArguments(
 )
 ```
 
-### Mixed precision
+### 혼합 정밀도[[mixed-precision]]
 
-Mixed precision accelerates training speed by performing some calculations in half-precision (fp16) and some in full-precision (fp32). The half-precision calculations boosts training speed because it's not as computationally expensive as performing the calculations in full-precision. Meanwhile, preserving some of the calculations in full-precision maintains accuracy.
+혼합 정밀도는 일부 연산을 반정밀도(fp16)로, 일부를 전정밀도(fp32)로 처리해 학습 속도를 높입니다. 반정밀도 연산은 전정밀도에 비해 연산 비용이 적어 빠르고, 일부를 전정밀도로 유지해 정확도를 보장합니다.
 
-There are several data types available for mixed precision training.
+혼합 정밀도 학습에는 여러 데이터 타입이 있습니다.
 
-<hfoptions id="mixed-precision">
-<hfoption id="fp16">
+<hfoptions id="mixed-precision"> <hfoption id="fp16">
+혼합 정밀도 학습의 주요 이점은 활성화를 fp16으로 저장하는 것입니다.
 
-The main advantage of mixed precision training is saving the activations in fp16.
+[TrainingArguments]의 [~TrainingArguments.fp16]를 설정해 fp16 데이터 타입으로 혼합 정밀도 학습을 활성화하세요.
 
-Configure [`~TrainingArguments.fp16`] in [`TrainingArguments`] to enable mixed precision training with the fp16 data type.
-
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -121,16 +112,14 @@ args = TrainingArguments(
 )
 ```
 
-fp16 isn't memory-optimized because the gradients that are computed in fp16 are converted back to fp32 during the optimization step. You may end up using more GPU memory, especially for small batch sizes, because there are now two versions (fp16 and fp32) of the model on the GPU.
-
-</hfoption>
-<hfoption id="bf16">
+fp16은 메모리 최적화가 되지 않을 수 있습니다. 최적화 단계에서 fp16으로 계산된 그래디언트가 fp32로 변환되므로 특히 작은 배치 크기에서는 GPU 메모리가 더 많이 사용될 수 있습니다.
 
-[bf16](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) trades off some precision for a much larger dynamic range, which is helpful for avoiding overflow and underflow errors. You can use bf16 without adding any loss scaling methods like you would with fp16. bf16 is supported by NVIDIAs Ampere architecture or newer.
+</hfoption> <hfoption id="bf16">
+bf16은 일부 정밀도를 희생해 더 넓은 동적 범위를 제공하여 오버플로/언더플로 오류를 방지합니다. fp16과 달리 손실 스케일링 없이 사용할 수 있습니다. bf16은 NVIDIA Ampere 이상의 아키텍처에서 지원됩니다.
 
-Configure [`~TrainingArguments.bf16`] in [`TrainingArguments`] to enable mixed precision training with the bf16 data type.
+[TrainingArguments]의 [~TrainingArguments.bf16]를 설정해 bf16 데이터 타입으로 혼합 정밀도 학습을 활성화하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -141,22 +130,20 @@ args = TrainingArguments(
 )
 ```
 
-</hfoption>
-<hfoption id="tf32">
+</hfoption> <hfoption id="tf32">
+tf32는 NVIDIA Ampere GPU에서 합성곱과 행렬 곱 입력을 tf32로 변환합니다. 나머지 연산과 저장은 fp32로 유지됩니다. tf32는 fp32와 동일한 범위를 유지하면서 fp16의 정밀도를 제공하고 bf16보다 정밀합니다. tf32를 fp16이나 bf16과 혼합하면 처리량을 최대 16배까지 높일 수 있습니다.
 
-[tf32](https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/) is a mode on NVIDIA Ampere GPUs that convert the convolution and matrix multiplication inputs to tf32. All other storage and operations are kept in fp32. This allows tf32 to maintain the same range as fp32, the same precision as fp16 and more precision than bf16. Combining tf32 with fp16 or bf16 mixed precision training can improve throughput by 16x.
+NVIDIA Ampere GPU에서는 기본적으로 활성화되어 있지만 아래와 같이 명시적으로 설정할 수도 있습니다.
 
-tf32 is enabled by default on NVIDIA Ampere GPUs, but you can also add the code below to your fp32 training or inference code to explicitly enable it.
-
-```py
+```python
 import torch
 torch.backends.cuda.matmul.allow_tf32 = True
 torch.backends.cudnn.allow_tf32 = True
 ```
 
-Configure [tf32()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.tf32) in [`TrainingArguments`] to enable mixed precision training with tf32 mode.
+[TrainingArguments]의 tf32()를 설정해 tf32 모드로 혼합 정밀도 학습을 활성화하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -168,16 +155,15 @@ args = TrainingArguments(
 )
 ```
 
-</hfoption>
-</hfoptions>
+</hfoption> </hfoptions>
 
-### Optimizers
+### 옵티마이저[[optimizers]]
 
-Transformers implements the [AdamW (adamw_torch)](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer from PyTorch by default. But because it stores a weighted average of past gradients, it requires additional memory proportional to the number of model parameters to store the past gradients. This can be an issue when training very large models, and in such cases, you should consider choosing a different optimizer. For example, if you have [Apex](https://nvidia.github.io/apex/index.html) installed on either [NVIDIA](https://github.com/NVIDIA/apex) or [AMD](https://github.com/ROCm/apex), then using the `adamw_apex_fused` optimizer provides the fastest training for all AdamW optimizers.
+Transformers는 기본적으로 PyTorch의 AdamW (adamw_torch) 옵티마이저를 사용합니다. 과거 그래디언트의 가중 평균을 저장하기 때문에 모델 매개변수 수에 비례해 메모리가 추가로 필요합니다. 큰 모델을 학습할 때는 다른 옵티마이저를 고려하세요. 예를 들어, Apex를 설치했다면 adamw_apex_fused를 사용해 가장 빠른 학습 속도를 얻을 수 있습니다.
 
-Configure [`~TrainingArguments.optim`] in [`TrainingArguments`] to choose an optimizer.
+[TrainingArguments]의 [~TrainingArguments.optim]을 설정해 옵티마이저를 선택하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -189,20 +175,20 @@ args = TrainingArguments(
 )
 ```
 
-There are many optimizers to choose from (refer to [OptimizerNames](https://github.com/huggingface/transformers/blob/34f4080ff59b1668d919a1ba9f8bc4a3a2a3f478/src/transformers/training_args.py#L145) for a full supported list) depending on your training scenario. For example, Adafactor can significantly reduce memory requirements by storing a weighted average of a row or column instead of each element in the matrix at the cost of slower convergence. Another example is using a [8-bit AdamW optimizer](https://huggingface.co/docs/bitsandbytes) from bitsandbytes to quantize optimizer states. The optimizer state is stored in a lower precision and dequantized before being used in the optimizer step.
+학습 시나리오에 따라 다양한 옵티마이저가 있습니다. (전체 목록은 OptimizerNames를 참고하세요.) 예를 들어, Adafactor는 메모리를 절약하지만 수렴 속도가 느립니다. 또 다른 예로 8-bit AdamW는 옵티마이저 상태를 양자화해 메모리를 줄입니다.
 
-Refer to the [optimizer](./optimizers) guide for to learn about more specialized optimizers.
+더 많은 특화된 옵티마이저에 대해서는 optimizer 가이드를 참고하세요.
 
-### Data preloading
+### 데이터 사전 적재[[data-preloading]]
 
-Data preloading loads and prepares batches of data in advance on the CPU to ensure the GPU is continuously working, reducing GPU idling and increasing utilization. There are two ways to preload data to ensure the GPU is always working.
+데이터 사전 적재는 CPU에서 데이터를 미리 준비해 GPU가 지속적으로 작업하도록 하여 유휴 시간을 줄이고 활용도를 높입니다. 이를 위해 두 가지 방법이 있습니다.
 
-1. Allocate pinned memory on the CPU to store the data and transfer it directly to the GPU.
-2. Increase the number of CPU threads or workers to preload the data faster.
+1. CPU에 핀 메모리를 할당해 데이터를 직접 GPU로 전송합니다.
+2. CPU 스레드나 작업자(worker) 수를 늘려 데이터를 더 빠르게 적재합니다.
 
-Configure [`~TrainingArguments.dataloader_pin_memory`] and [`~TrainingArguments.dataloader_num_workers`] in [`TrainingArguments`] to allocate pinned memory and increase the number of workers.
+[TrainingArguments]의 [~TrainingArguments.dataloader_pin_memory]와 [~TrainingArguments.dataloader_num_workers]를 설정해 핀 메모리와 작업자 수를 조절하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -216,17 +202,17 @@ args = TrainingArguments(
 )
 ```
 
-## PyTorch
+## PyTorch[[pytorch]]
 
-PyTorch provides several features for reducing memory requirements and increasing training speed. These features can often be enabled in Transformers by only adding a few lines of code.
+PyTorch는 메모리 사용량을 줄이고 학습 속도를 높이는 여러 기능을 제공합니다. 이들은 Transformers에서 몇 줄의 코드만 추가해도 사용할 수 있습니다.
 
-### torch.empty_cache_steps
+### torch.empty_cache_steps[[torch-empty-cache-steps]]
 
-The [torch.cuda.empty_cache](https://pytorch.org/docs/stable/generated/torch.cuda.empty_cache.html#torch.cuda.empty_cache) function releases unused cached memory, which can help avoid out-of-memory (OOM) errors at the cost of ~10% slower training.
+torch.cuda.empty_cache 함수는 사용하지 않는 캐시 메모리를 해제해 OOM(메모리 부족) 오류를 방지하지만 학습 속도가 약 10% 느려집니다.
 
-Use [torch_empty_cache_steps()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.torch_empty_cache_steps) in [`TrainingArguments`] to enable it after a certain number of training steps.
+[TrainingArguments]의 torch_empty_cache_steps()를 설정해 일정 스텝마다 실행하도록 설정하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -241,13 +227,13 @@ args = TrainingArguments(
 )
 ```
 
-### torch.compile
+### torch.compile[[torch-compile]]
 
-[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) compiles PyTorch code into optimized kernels that significantly speed up training. This feature relies on TorchDynamo to capture PyTorch graphs with the Frame Evaluation API. The graph can be further compiled into optimized kernels for different backends.
+torch.compile은 PyTorch 코드를 최적화된 커널로 컴파일해 학습 속도를 높입니다. TorchDynamo가 Frame Evaluation API를 이용해 그래프를 캡처한 뒤, 이를 백엔드별 최적화된 커널로 컴파일합니다.
 
-Configure [`~TrainingArguments.torch_compile`] in [`TrainingArguments`] to enable it, and configure [torch_compile_backend()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.torch_compile_backend) to select a backend to use.
+[TrainingArguments]의 [~TrainingArguments.torch_compile]를 설정해 활성화하고, torch_compile_backend()로 백엔드를 선택하세요.
 
-```py
+```python
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -264,32 +250,34 @@ args = TrainingArguments(
 )
 ```
 
-Refer to the table below to help you choose the right backend for your training scenario.
+아래 표를 참고해 학습 상황에 맞는 백엔드를 선택하세요.
+
+| 백엔드         | 설명                                                                                                                         | 목적         |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------- | ------------ |
+| eager          | PyTorch로 GraphModule 실행                                                                                                   | 디버깅       |
+| aot_eager      | AOTAutograd 그래프를 PyTorch eager 모드로 실행                                                                               | 디버깅       |
+| inductor       | TorchInductor와 CUDA Graphs 사용                                                                                             | 학습 및 추론 |
+| nvfuser        | nvFuser와 TorchScript 사용                                                                                                   | 학습 및 추론 |
+| aot_nvfuser    | AOTAutograd와 nvFuser 사용                                                                                                   | 학습 및 추론 |
+| aot_cudagraphs | AOTAutograd와 CUDA Graphs 사용                                                                                               | 학습 및 추론 |
+| ofi            | TorchScript의 [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) 사용 | 추론         |
+| fx2trt         | [Torch-TensorRT](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) 사용                              | 추론         |
+| onnxrt         | [ONNX-RT](https://onnxruntime.ai/) 사용                                                                                      | 추론         |
+| ipex           | [IPEX](https://github.com/intel/intel-extension-for-pytorch) 사용                                                            | 추론         |
+
+### 스케일드 닷 프로덕션 어텐션[[scaled-dot-production-attention]]
 
-| backend | description | goal |
-|---|---|---|
-| eager | uses PyTorch to run extracted GraphModule | debugging |
-| aot_eager | uses PyTorch eager mode for AOTAutograd's extracted forward and backward graphs | debugging |
-| inductor | uses TorchInductor with AOTAutograd and CUDA Graphs by leveraging Triton kernels | training and inference |
-| nvfuser | uses nvFuser with TorchScript | training and inference |
-| aot_nvfuser | uses nvFuser with AOTAutograd | training and inference |
-| aot_cudagraphs | uses CUDA Graphs with AOTAutograd | training and inference |
-| ofi | uses TorchScripts [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html#torch-jit-optimize-for-inference) | inference |
-| fx2trt | uses [Torch-TensorRT](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) | inference |
-| onnxrt | uses [ONNX-RT](https://onnxruntime.ai/) for CPU and GPU inference | inference |
-| ipex | uses [IPEX](https://github.com/intel/intel-extension-for-pytorch) for CPU inference | inference |
+torch.nn.functional.scaled_dot_product_attention (SDPA)는 스케일드 닷 프로덕션 어텐션 메커니즘의 PyTorch 네이티브 구현입니다. SDPA는 기존 어텐션 메커니즘보다 효율적이고 최적화되어 있으며, 세 가지 구현을 지원합니다.
 
-### Scaled dot production attention
+FlashAttention2: fp16 또는 bf16으로 모델을 캐스팅하면 자동으로 활성화됩니다.
 
-[torch.nn.functional.scaled_dot_product_attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) is a native PyTorch implementation of the scaled dot product attention mechanism. SDPA is more efficient and optimized than the original attention mechanism in transformer models. It supports three types of scaled dot product attention.
+xFormers 또는 메모리 효율적인 어텐션: fp32를 지원합니다.
 
-- [FlashAttention2](https://github.com/Dao-AILab/flash-attention) is automatically enabled for models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate type first.
-- [xFormers](https://github.com/facebookresearch/xformers) or Memory-Efficient Attention supports models with the fp32 torch type.
-- C++ implementation of scaled dot product attention.
+C++로 구현된 스케일드 닷 프로덕션 어텐션.
 
-SDPA is enabled by default for PyTorch 2.1.1+, but it can be explicitly enabled by setting `attn_implementation="sdpa"` in [`~PreTrainedModel.from_pretrained`].
+PyTorch 2.1.1+에서는 기본으로 활성화되어 있지만, [~PreTrainedModel.from_pretrained]에서 attn_implementation="sdpa"로 명시할 수도 있습니다.
 
-```py
+```python
 from transformers import AutoModelForCausalLM
 
 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")

From 41834e42c99ede07bc22b3dbda9b5a01d04c9c57 Mon Sep 17 00:00:00 2001
From: D15M4S <dismas5184@gmail.com>
Date: Mon, 21 Jul 2025 20:28:18 +0900
Subject: [PATCH 3/8] fix: manual edits

---
 docs/source/ko/perf_train_gpu_one.md | 192 ++++++++++++++-------------
 1 file changed, 101 insertions(+), 91 deletions(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index a484aaa1b991..9ace85bdc8a3 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -13,37 +13,37 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# GPU[[gpu]]
+# GPU [[gpu]]
 
-GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 모델 학습에 널리 사용됩니다. GPU와 모델 크기에 따라 수십억 개의 매개변수를 가진 모델도 학습할 수 있습니다. 핵심은 GPU 메모리 사용량(데이터 처리량/학습 시간)과 학습 속도 간의 균형을 잘 맞추는 것입니다.
+GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 모델 학습에 널리 사용됩니다. GPU 사양과 모델 크기에 따라 수십억 개 매개변수를 가진 모델도 학습할 수 있습니다. 핵심은 GPU 메모리 활용도(데이터 처리량/학습 시간)와 학습 속도 사이에서 최적의 균형을 찾는 것입니다.
 
-이 가이드는 GPU에서 효율적으로 모델을 학습하기 위해 Transformers와 PyTorch에서 제공하는 기능을 설명합니다. 대부분의 경우, 여러 기능을 함께 사용해 학습을 최적화하는 것이 좋습니다.
+이 가이드는 Transformers와 PyTorch에서 GPU에서 모델을 효율적으로 학습하기 위해 제공하는 기능을 소개합니다. 대부분 이러한 기능을 조합하여 학습을 최적화하는 것이 좋습니다.
 
-아래 표를 참고하시면 학습 상황에 적합한 기능을 빠르게 확인하실 수 있습니다.
+아래 표를 참고하면 자신의 학습 시나리오에 적합한 기능을 빠르게 파악할 수 있습니다.
 
-| 기능                              | 학습 속도 | 메모리 사용 |
-| --------------------------------- | --------- | ----------- |
-| 배치 크기                         | 예        | 예          |
-| 그래디언트 누적                   | 아니요    | 예          |
-| 그래디언트 체크포인팅             | 아니요    | 예          |
-| 혼합 정밀도                       | 예        | 경우에 따라 |
-| 옵티마이저                        | 예        | 예          |
-| 데이터 사전 적재                  | 예        | 아니요      |
-| torch_empty_cache_steps           | 아니요    | 예          |
-| torch.compile                     | 예        | 아니요      |
-| 스케일드 닷 프로덕션 어텐션(SDPA) | 예        | 예          |
+| 기능                        | 학습 속도 | 메모리 사용량 |
+| --------------------------- | --------- | ------------- |
+| 배치 크기                   | 예        | 예            |
+| 그레이디언트 누적           | 아니요    | 예            |
+| 그레이디언트 체크포인팅     | 아니요    | 예            |
+| 혼합 정밀도                 | 예        | 조건부        |
+| 옵티마이저                  | 예        | 예            |
+| 데이터 사전 적재            | 예        | 아니요        |
+| torch_empty_cache_steps     | 아니요    | 예            |
+| torch.compile               | 예        | 아니요        |
+| 스케일된 내적 어텐션 (SDPA) | 예        | 예            |
 
 ## Trainer[[trainer]]
 
-[Trainer](./trainer)는 [`TrainingArguments`]를 통해 설정할 수 있는 다양한 유용한 학습 기능을 지원합니다. 이 섹션에서는 학습을 최적화하는 데 특히 중요한 기능들을 소개합니다.
+Trainer는 [TrainingArguments]로 설정할 수 있는 다양한 학습 기능을 제공합니다. 이번 섹션에서는 학습 최적화에 특히 유용한 주요 기능 몇 가지를 살펴봅니다.
 
 ### 배치 크기[[batch-size]]
 
-배치 크기는 GPU 학습 효율을 결정하는 가장 중요한 하이퍼파라미터 중 하나로, 메모리 사용량과 학습 속도에 영향을 줍니다. 더 큰 배치 크기는 GPU의 병렬 처리 능력을 활용하기 때문에 더 빠른 학습을 제공합니다. 8, 64, 128, 256, 512처럼 2의 거듭제곱 크기를 사용하는 것이 권장됩니다. 배치 크기는 GPU와 모델의 데이터 타입에 따라 달라집니다.
+배치 크기는 GPU 학습 효율을 좌우하는 가장 중요한 하이퍼파라미터 중 하나로, 메모리 사용량과 학습 속도에 직접적인 영향을 줍니다. 배치 크기를 크게 하면 GPU의 병렬 처리 능력을 극대화하여 학습 속도를 높일 수 있습니다. 일반적으로 8, 64, 128, 256, 512처럼 2의 거듭제곱 값을 사용하는 것이 좋습니다. 적절한 배치 크기는 GPU 사양과 모델의 데이터 타입에 따라 달라집니다.
 
-[`TrainingArguments`]의 [`~TrainingArguments.per_device_train_batch_size`]를 설정하세요.
+배치 크기는 [TrainingArguments]의 [~TrainingArguments.per_device_train_batch_size] 옵션으로 설정합니다.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -52,35 +52,42 @@ args = TrainingArguments(
 )
 ```
 
-입력 피처, 출력 뉴런 수, 배치 크기가 성능에 어떻게 영향을 미치는지에 대해서는 NVIDIA Performance 가이드를 참고하세요. 이는 GPU가 수행하는 일반 행렬 곱(GEMM)과 관련이 있으며, 매개변수가 클수록 병렬화와 효율성이 높아집니다.
+성능, 입력 피처 수와 출력 뉴런 수, 배치 크기가 성능에 미치는 영향에 대해서는 NVIDIA [Performance](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features) 가이드를 참고하세요. 이 매개변수들은 GPU에서 실행되는 General Matrix Multiplications(GEMMs)에 사용됩니다. 매개변수가 클수록 병렬화와 효율성이 향상됩니다.
 
-Tensore Core Requirements도 데이터 타입과 GPU에 따라 텐서 곱셈 속도를 극대화할 수 있는 배치 크기 선택에 유용합니다. 예를 들어, fp16에서는 8의 배수가, A100 GPU에서는 64의 배수가 권장됩니다.
+데이터 타입과 GPU에 따른 최적의 배치 크기를 선택해 텐서 곱셈 속도를 극대화하려면, [Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) 섹션을 참고하는 것이 유용합니다. 그 예시로, fp16에서는 8의 배수가 권장되지만, A100 GPU에서는 64의 배수가 더 적합하다는 사실을 확인할 수 있습니다.
 
-마지막으로, 작은 매개변수에서는 Dimension Quantization Effects를 고려하세요. 행렬 차원이 GPU의 스레드 블록 타일 크기로 나누어떨어지지 않으면 GPU 자원이 충분히 활용되지 못할 수 있습니다. 적절한 배치 크기를 선택해 행렬이 타일 크기로 나누어지도록 하면 학습 속도를 크게 높일 수 있습니다.
+마지막으로, 작은 매개변수를 사용할 때는 [Dimension Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization)를 고려하세요. 행렬 차원이 GPU 스레드 블록의 타일 크기로 나누어지지 않으면 타일 양자화가 발생하여 GPU 자원을 충분히 활용하지 못합니다. 행렬이 타일 크기로 정확히 나뉘도록 올바른 배치 크기 배수를 선택하며 학습 속도가 크게 향상됩니다.
 
-### 그래디언트 누적[[gradient-accumulation]]
+### 그레이디언트 누적[[gradient-accumulation]]
 
-그래디언트 누적은 메모리 제약을 극복해 단일 GPU에 맞지 않는 큰 모델을 학습할 수 있게 합니다. 여러 미니 배치에서 그래디언트를 누적한 뒤 파라미터를 업데이트하기 때문에 메모리를 절약하고 더 큰 유효 배치 크기로 학습할 수 있습니다. 다만, 추가적인 순전파 및 역전파가 필요하므로 학습 속도는 느려질 수 있습니다.
+그레이디언트 누적은 메모리 제약을 극복하는 방법으로, 단일 GPU에 맞지 않는 매우 큰 모델을 학습할 때 유용합니다. 이는 매개변수를 업데이트하기 전에 여러 미니 배치에 걸쳐 그레이디언트를 누적하는 방식입니다. 그 결과, 저장해야 하는 그레이디언트 수가 줄어 메모리 사용량이 줄어들고, 일반적으로 하나의 배치에서만 매개변수를 갱신하는 방식보다 더 큰 유효 배치 크기로 학습할 수 있습니다. 다만, 추가적인 순전파와 역전파가 필요하기 때문에 학습 속도가 느려질 수 있습니다.
 
-[TrainingArguments]의 [~TrainingArguments.per_device_train_batch_size]를 설정해 활성화하세요.
+그레이디언트 누적을 활성화하려면 [`TrainingArguments`]에서 [`TrainingArguments.per_device_train_batch_size`] 옵션을 설정하세요.
 
-```python
+```py
 from transformers import TrainingArguments
 
-# 유효 배치 크기 64
+# 효율적인 배치 크기 64
 args = TrainingArguments(
     per_device_train_batch_size=4,
     gradient_accumulation_steps=16,
 )
 ```
 
-### 그래디언트 체크포인팅[[gradient-checkpointing]]
+학습 속도가 느려질 수 있기 때문에 그레이디언트 누적 단계를 너무 크게 설정하지 않는 것이 좋습니다. 아래 예시를 참고하세요, GPU에 담을 수 있는 최대 배치 크기가 4라면 GPU의 효율적인 사용을 위해 배치 크기를 4로 유지하는 것이 좋습니다.
 
-그래디언트 체크포인팅은 역전파 시 일부 중간 활성화만 저장하고 나머지는 다시 계산해 메모리 사용량을 줄입니다. 이렇게 하면 순전파의 모든 중간 활성화를 저장하지 않아도 되어 메모리 오버헤드를 줄일 수 있습니다. 단, 약 20% 정도 학습 속도가 느려집니다.
+| 배치 크기 | 그레이디언트 누적 단계 | 효율적인 배치 크기 |     |
+| --------- | ---------------------- | ------------------ | --- |
+| 1         | 64                     | 64                 | 👎  |
+| 4         | 16                     | 64                 | 👍  |
 
-[TrainingArguments]의 [~TrainingArguments.gradient_checkpointing]를 설정해 활성화하세요.
+### 그레이디언트 체크포인팅[[gradient-checkpointing]]
 
-```python
+그레이디언트 체크포인팅은 역전파 과정에서 일부 중간 활성화 값만 저장하고 나머지는 다시 계산해 메모리 사용량을 줄입니다. 이를 통해 순전파 과정에서 모든 중간 활성화 값을 저장하지 않아도 되어 메모리 오버헤드를 크게 줄일 수 있습니다. 다만, 학습 속도가 약 20% 느려지는 한계가 있습니다.
+
+그레이디언트 누적을 활성화하려면 [`TrainingArguments`]에서 [`~TrainingArguments.gradient_checkpointing`] 옵션을 설정하세요.
+
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -92,16 +99,18 @@ args = TrainingArguments(
 
 ### 혼합 정밀도[[mixed-precision]]
 
-혼합 정밀도는 일부 연산을 반정밀도(fp16)로, 일부를 전정밀도(fp32)로 처리해 학습 속도를 높입니다. 반정밀도 연산은 전정밀도에 비해 연산 비용이 적어 빠르고, 일부를 전정밀도로 유지해 정확도를 보장합니다.
+혼합 정밀도는 일부 계산을 반정밀도(fp16)로, 나머지를 전정밀도(fp32)로 수행해 학습 속도를 높이는 기법입니다. 반정밀도 계산은 전정밀도보다 계산량이 적어 더 빠르게 수행됩니다. 한편, 전정밀도로 일부 계산을 수행하면 정확도를 유지할 수 있습니다.
 
-혼합 정밀도 학습에는 여러 데이터 타입이 있습니다.
+혼합 정밀도 학습을 위해 여러 자료형을 사용할 수 있습니다.
 
-<hfoptions id="mixed-precision"> <hfoption id="fp16">
-혼합 정밀도 학습의 주요 이점은 활성화를 fp16으로 저장하는 것입니다.
+<hfoptions id="mixed-precision">
+<hfoption id="fp16">
 
-[TrainingArguments]의 [~TrainingArguments.fp16]를 설정해 fp16 데이터 타입으로 혼합 정밀도 학습을 활성화하세요.
+혼합 정밀도 학습의 주요 장점은 활성화 값을 fp16으로 저장할 수 있다는 것입니다.
 
-```python
+fp16 자료형으로 혼합 정밀도 학습을 활성화하려면 [`TrainingArguments`]에서 [`~TrainingArguments.fp16`] 옵션을 설정하세요.
+
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -112,14 +121,15 @@ args = TrainingArguments(
 )
 ```
 
-fp16은 메모리 최적화가 되지 않을 수 있습니다. 최적화 단계에서 fp16으로 계산된 그래디언트가 fp32로 변환되므로 특히 작은 배치 크기에서는 GPU 메모리가 더 많이 사용될 수 있습니다.
+fp16은 메모리 사용에 최적화된 방식이 아닙니다. 이는 fp16으로 계산된 그레이디언트가 최적화 단계에서 fp32로 다시 변환되기 때문입니다. 특히 배치 크기가 작을 때는, GPU에 두 가지 자료형(fp16, fp32)이 적재되어 있기 때문에 더 많은 GPU 메모리를 사용하게 됩니다.
+</hfoption>
+<hfoption id="bf16">
 
-</hfoption> <hfoption id="bf16">
-bf16은 일부 정밀도를 희생해 더 넓은 동적 범위를 제공하여 오버플로/언더플로 오류를 방지합니다. fp16과 달리 손실 스케일링 없이 사용할 수 있습니다. bf16은 NVIDIA Ampere 이상의 아키텍처에서 지원됩니다.
+[bf16](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus)은 일부 정밀도를 포기하는 대신, 훨씬 더 넓은 동적 범위를 제공하여 오버플로와 언더플로 오류를 방지하는 데 도움이 됩니다. bf16은 fp16과 달리 손실 스케일링 기법을 추가하지 않고도 사용할 수 있습니다. bf16은 NVIDIA의 Ampere 아키텍처 이상에서 지원됩니다.
 
-[TrainingArguments]의 [~TrainingArguments.bf16]를 설정해 bf16 데이터 타입으로 혼합 정밀도 학습을 활성화하세요.
+bf16 자료형으로 혼합 정밀도 학습을 활성화하려면 [`TrainingArguments`]에서 [`~TrainingArguments.bf16`] 옵션을 설정하세요.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -130,20 +140,22 @@ args = TrainingArguments(
 )
 ```
 
-</hfoption> <hfoption id="tf32">
-tf32는 NVIDIA Ampere GPU에서 합성곱과 행렬 곱 입력을 tf32로 변환합니다. 나머지 연산과 저장은 fp32로 유지됩니다. tf32는 fp32와 동일한 범위를 유지하면서 fp16의 정밀도를 제공하고 bf16보다 정밀합니다. tf32를 fp16이나 bf16과 혼합하면 처리량을 최대 16배까지 높일 수 있습니다.
+</hfoption>
+<hfoption id="tf32">
+
+[tf32](https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/)는 NVIDIA Ampere GPU에서 합성곱과 행렬곱 입력을 tf32로 변환하는 모드입니다. 다른 모든 저장과 연산은 fp32로 유지됩니다. 이를 통해 tf32는 fp32와 동일한 범위, fp16과 동일한 정밀도, 그리고 bf16보다 더 높은 정밀도를 유지할 수 있습니다. tf32를 fp16 또는 bf16 혼합 정밀도 학습과 결합하면 처리량을 16배 향상할 수 있습니다.
 
-NVIDIA Ampere GPU에서는 기본적으로 활성화되어 있지만 아래와 같이 명시적으로 설정할 수도 있습니다.
+tf32는 NVIDIA Ampere GPU에서 기본적으로 활성화되어 있지만, fp32 학습 또는 추론 코드에 아래 코드를 추가하여 명시적으로 활성화할 수도 있습니다.
 
-```python
+```py
 import torch
 torch.backends.cuda.matmul.allow_tf32 = True
 torch.backends.cudnn.allow_tf32 = True
 ```
 
-[TrainingArguments]의 tf32()를 설정해 tf32 모드로 혼합 정밀도 학습을 활성화하세요.
+tf32 모드에서 혼합 정밀도 학습을 활성화하려면 [`TrainingArguments`]에서 [tf32()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.tf32) 옵션을 설정하세요.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -155,15 +167,16 @@ args = TrainingArguments(
 )
 ```
 
-</hfoption> </hfoptions>
+</hfoption>
+</hfoptions>
 
 ### 옵티마이저[[optimizers]]
 
-Transformers는 기본적으로 PyTorch의 AdamW (adamw_torch) 옵티마이저를 사용합니다. 과거 그래디언트의 가중 평균을 저장하기 때문에 모델 매개변수 수에 비례해 메모리가 추가로 필요합니다. 큰 모델을 학습할 때는 다른 옵티마이저를 고려하세요. 예를 들어, Apex를 설치했다면 adamw_apex_fused를 사용해 가장 빠른 학습 속도를 얻을 수 있습니다.
+Transformers는 기본적으로 PyTorch의 [AdamW (adamw_torch)](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) 옵티마이저를 사용합니다. 하지만, 이 옵티마이저는 과거 그레이디언트의 가중 평균을 저장하기 때문에, 그레이디언트를 저장하기 위해 모델 매개변수 수에 비례한 추가 메모리가 필요합니다. 이는 매우 큰 모델을 학습할 때 문제가 될 수 있으며, 이러면 다른 옵티마이저를 선택하는 것을 고려해야 합니다. 예를 들어, [NVIDIA](https://github.com/NVIDIA/apex) 또는 [AMD](https://github.com/ROCm/apex)에 [Apex](https://nvidia.github.io/apex/index.html)가 설치되어 있다면, 모든 AdamW 옵티마이저 중 `adamw_apex_fused` 옵티마이저를 사용하는 것이 가장 빠른 학습 속도를 얻을 수 있습니다.
 
-[TrainingArguments]의 [~TrainingArguments.optim]을 설정해 옵티마이저를 선택하세요.
+옵티마이저를 선택하기 위해서는 [`TrainingArguments`]에서 [`~TrainingArguments.optim`] 옵션을 설정하세요.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -174,21 +187,20 @@ args = TrainingArguments(
     optim="adamw_bnb_8bit"
 )
 ```
+학습 시나리오에 맞게 선택할 수 있는 다양한 옵티마이저가 있습니다. (전체 지원 목록은 [OptimizerNames](https://github.com/huggingface/transformers/blob/34f4080ff59b1668d919a1ba9f8bc4a3a2a3f478/src/transformers/training_args.py#L145)를 참고하세요) 예를 들어 Adafactor는 행렬의 각 요소 대신 행 또는 열 단위의 가중 평균만 저장해 메모리 요구량을 크게 줄일 수 있지만, 수렴 속도는 느려질 수 있습니다. 또 다른 예로, bitandbytes의 [8-bit AdamW optimizer](https://huggingface.co/docs/bitsandbytes)를 사용하면 옵티마이저의 상태를 8비트로 양자화할 수 있습니다. 옵티마이저 상태는 낮은 정밀도로 저장되었다가 옵티마이저 단계에서 사용되기 전에 역 양자화됩니다.
 
-학습 시나리오에 따라 다양한 옵티마이저가 있습니다. (전체 목록은 OptimizerNames를 참고하세요.) 예를 들어, Adafactor는 메모리를 절약하지만 수렴 속도가 느립니다. 또 다른 예로 8-bit AdamW는 옵티마이저 상태를 양자화해 메모리를 줄입니다.
-
-더 많은 특화된 옵티마이저에 대해서는 optimizer 가이드를 참고하세요.
+특화된 옵티마이저에 대해 더 알고 싶다면 [optimizer](./optimizers) 가이드를 참고하세요.
 
 ### 데이터 사전 적재[[data-preloading]]
 
-데이터 사전 적재는 CPU에서 데이터를 미리 준비해 GPU가 지속적으로 작업하도록 하여 유휴 시간을 줄이고 활용도를 높입니다. 이를 위해 두 가지 방법이 있습니다.
+데이터 사전 적재는 GPU가 지속적으로 작업할 수 있도록 CPU에서 미리 배치 단위의 데이터를 적재하고 준비하는 기능입니다. 이를 통해 GPU 유휴 시간을 줄이고 활용도를 높일 수 있습니다. GPU가 항상 작업을 계속하도록 하려면 다음 데이터 사전 적재를 위한 두 가지 방법을 사용할 수 있습니다.
 
-1. CPU에 핀 메모리를 할당해 데이터를 직접 GPU로 전송합니다.
-2. CPU 스레드나 작업자(worker) 수를 늘려 데이터를 더 빠르게 적재합니다.
+1. 데이터를 저장할 고정 메모리를 CPU에 할당한 뒤, 이를 GPU로 직접 전송합니다.
+2. CPU 스레드 및 워커 수를 늘려 데이터를 더 빠르게 사전 적재합니다.
 
-[TrainingArguments]의 [~TrainingArguments.dataloader_pin_memory]와 [~TrainingArguments.dataloader_num_workers]를 설정해 핀 메모리와 작업자 수를 조절하세요.
+고정 메모리를 할당하고 워커 수를 늘리기 위해서는 [`TrainingArguments`]에서 [`~TrainingArguments.dataloader_pin_memory`]와 [`~TrainingArguments.dataloader_num_workers`] 옵션을 설정하세요.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -204,15 +216,15 @@ args = TrainingArguments(
 
 ## PyTorch[[pytorch]]
 
-PyTorch는 메모리 사용량을 줄이고 학습 속도를 높이는 여러 기능을 제공합니다. 이들은 Transformers에서 몇 줄의 코드만 추가해도 사용할 수 있습니다.
+PyTorch는 메모리 요구사항을 줄이고 학습 속도를 높이기 위한 여러 기능을 제공합니다. 이러한 기능들은 Transformers에서 몇 줄의 코드만 추가하여 활성화할 수 있습니다.
 
-### torch.empty_cache_steps[[torch-empty-cache-steps]]
+### torch.empty_cache_steps[[torchemptycachesteps]]
 
-torch.cuda.empty_cache 함수는 사용하지 않는 캐시 메모리를 해제해 OOM(메모리 부족) 오류를 방지하지만 학습 속도가 약 10% 느려집니다.
+[torch.cuda.empty_cache](https://pytorch.org/docs/stable/generated/torch.cuda.empty_cache.html#torch.cuda.empty_cache) 함수는 사용하지 않는 캐시 메모리를 해제하여 메모리 부족(OOM) 오류를 방지할 수 있지만, 학습 속도가 약 10% 느려질 수 있습니다.
 
-[TrainingArguments]의 torch_empty_cache_steps()를 설정해 일정 스텝마다 실행하도록 설정하세요.
+특정 학습 단계 이후에 이 기능을 활성화하고 싶다면, [`TrainingArguments`]에서 [torch_empty_cache_steps()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.torch_empty_cache_steps)를 설정하세요.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -227,13 +239,13 @@ args = TrainingArguments(
 )
 ```
 
-### torch.compile[[torch-compile]]
+### torch.compile[[torchcompile]]
 
-torch.compile은 PyTorch 코드를 최적화된 커널로 컴파일해 학습 속도를 높입니다. TorchDynamo가 Frame Evaluation API를 이용해 그래프를 캡처한 뒤, 이를 백엔드별 최적화된 커널로 컴파일합니다.
+[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)은 PyTorch 코드를 최적화된 커널로 컴파일해 학습 속도를 크게 높여줍니다. 이 기능은 TorchDynamo를 사용해 프레임 평가 API로부터 PyTorch 그래프를 캡처하며, 이렇게 캡처한 그래프는 다양한 백엔드에 추가로 최적화된 커널로 컴파일될 수 있습니다.
 
-[TrainingArguments]의 [~TrainingArguments.torch_compile]를 설정해 활성화하고, torch_compile_backend()로 백엔드를 선택하세요.
+이를 활성화하려면 [`TrainingArguments`]에서 [`~TrainingArguments.torch_compile`]를 설정하세요. 백엔드는 [torch_compile_backend()](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.torch_compile_backend)를 통해 선택할 수 있습니다.
 
-```python
+```py
 from transformers import TrainingArguments
 
 args = TrainingArguments(
@@ -250,34 +262,32 @@ args = TrainingArguments(
 )
 ```
 
-아래 표를 참고해 학습 상황에 맞는 백엔드를 선택하세요.
-
-| 백엔드         | 설명                                                                                                                         | 목적         |
-| -------------- | ---------------------------------------------------------------------------------------------------------------------------- | ------------ |
-| eager          | PyTorch로 GraphModule 실행                                                                                                   | 디버깅       |
-| aot_eager      | AOTAutograd 그래프를 PyTorch eager 모드로 실행                                                                               | 디버깅       |
-| inductor       | TorchInductor와 CUDA Graphs 사용                                                                                             | 학습 및 추론 |
-| nvfuser        | nvFuser와 TorchScript 사용                                                                                                   | 학습 및 추론 |
-| aot_nvfuser    | AOTAutograd와 nvFuser 사용                                                                                                   | 학습 및 추론 |
-| aot_cudagraphs | AOTAutograd와 CUDA Graphs 사용                                                                                               | 학습 및 추론 |
-| ofi            | TorchScript의 [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) 사용 | 추론         |
-| fx2trt         | [Torch-TensorRT](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) 사용                              | 추론         |
-| onnxrt         | [ONNX-RT](https://onnxruntime.ai/) 사용                                                                                      | 추론         |
-| ipex           | [IPEX](https://github.com/intel/intel-extension-for-pytorch) 사용                                                            | 추론         |
-
-### 스케일드 닷 프로덕션 어텐션[[scaled-dot-production-attention]]
+아래 표를 참고하여 학습 시나리오에 적합한 백엔드를 선택하세요.
 
-torch.nn.functional.scaled_dot_product_attention (SDPA)는 스케일드 닷 프로덕션 어텐션 메커니즘의 PyTorch 네이티브 구현입니다. SDPA는 기존 어텐션 메커니즘보다 효율적이고 최적화되어 있으며, 세 가지 구현을 지원합니다.
+| 백엔드         | 설명                                                                                                                                                                   | 목표         |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ |
+| eager          | PyTorch를 사용해 추출된 GraphModule을 실행합니다                                                                                                                       | 디버깅       |
+| aot_eager      | AOTAutograd로 추출된 순전파 및 역전파 그래프를 Pytorch eager 모드로 실행합니다                                                                                         | 디버깅       |
+| inductor       | Triton 커널을 활용하는 TorchInductor와 AOTAutograd, CUDA Graphs를 사용합니다                                                                                           | 학습 및 추론 |
+| nvfuser        | TorchScript와 함께 nvFuser를 사용합니다                                                                                                                                | 학습 및 추론 |
+| aot_nvfuser    | AOTAutograd와 함께 nvFuser를 사용합니다                                                                                                                                | 학습 및 추론 |
+| aot_cudagraphs | AOTAutograd와 함께 CUDA Graphs를 사용합니다                                                                                                                            | 학습 및 추론 |
+| ofi            | TorchScripts의 [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html#torch-jit-optimize-for-inference)를 사용합니다 | 추론         |
+| fx2trt         | [Torch-TensorRT](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html)를 사용합니다                                                                | 추론         |
+| onnxrt         | CPU 및 GPU 추론을 위해 [ONNX-RT](https://onnxruntime.ai/)를 사용합니다                                                                                                 | 추론         |
+| ipex           | CPU 추론을 위해 [IPEX](https://github.com/intel/intel-extension-for-pytorch)를 사용합니다                                                                              | 추론         |
 
-FlashAttention2: fp16 또는 bf16으로 모델을 캐스팅하면 자동으로 활성화됩니다.
+### 스케일된 내적 어텐션[[scaled-dot-production-attention]]
 
-xFormers 또는 메모리 효율적인 어텐션: fp32를 지원합니다.
+[torch.nn.functional.scaled_dot_product_attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA)는 스케일된 내적 어텐션 메커니즘을 PyTorch에 내장해 구현한 함수입니다. SDPA는 트랜스포머 모델의 기존 어텐션 메커니즘보다 더 효율적이고 최적화되어 있습니다. 세 가지 유형의 스케일된 내적 어텐션을 지원합니다.
 
-C++로 구현된 스케일드 닷 프로덕션 어텐션.
+- [FlashAttention2](https://github.com/Dao-AILab/flash-attention)는 fp16 또는 bf16 torch 타입 모델에서 자동으로 활성화됩니다. 먼저 모델을 적절한 타입으로 캐스팅했는지 확인하세요.
+- [xFormers](https://github.com/facebookresearch/xformers) 또는 Memory-Efficient Attention은 fp32 torch 타입 모델을 지원합니다.
+- C++로 구현된 스케일된 내적 어텐션입니다.
 
-PyTorch 2.1.1+에서는 기본으로 활성화되어 있지만, [~PreTrainedModel.from_pretrained]에서 attn_implementation="sdpa"로 명시할 수도 있습니다.
+SDPA는 PyTorch 2.1.1 버전 이상에서 기본적으로 활성화되어 있지만, [`~PreTrainedModel.from_pretrained`]에서 `attn_implementation="sdpa"`를 설정해 명시적으로 활성화할 수 있습니다.
 
-```python
+```py
 from transformers import AutoModelForCausalLM
 
 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")

From 7dbe21fb4c98390158c7fa9e2678a6383078e1bf Mon Sep 17 00:00:00 2001
From: D15M4S <dismas5184@gmail.com>
Date: Mon, 21 Jul 2025 20:35:00 +0900
Subject: [PATCH 4/8] fix: Manually added missing backticks

---
 docs/source/ko/perf_train_gpu_one.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index 9ace85bdc8a3..f063b0382853 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -35,13 +35,13 @@ GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 
 
 ## Trainer[[trainer]]
 
-Trainer는 [TrainingArguments]로 설정할 수 있는 다양한 학습 기능을 제공합니다. 이번 섹션에서는 학습 최적화에 특히 유용한 주요 기능 몇 가지를 살펴봅니다.
+Trainer는 [`TrainingArguments`]로 설정할 수 있는 다양한 학습 기능을 제공합니다. 이번 섹션에서는 학습 최적화에 특히 유용한 주요 기능 몇 가지를 살펴봅니다.
 
 ### 배치 크기[[batch-size]]
 
 배치 크기는 GPU 학습 효율을 좌우하는 가장 중요한 하이퍼파라미터 중 하나로, 메모리 사용량과 학습 속도에 직접적인 영향을 줍니다. 배치 크기를 크게 하면 GPU의 병렬 처리 능력을 극대화하여 학습 속도를 높일 수 있습니다. 일반적으로 8, 64, 128, 256, 512처럼 2의 거듭제곱 값을 사용하는 것이 좋습니다. 적절한 배치 크기는 GPU 사양과 모델의 데이터 타입에 따라 달라집니다.
 
-배치 크기는 [TrainingArguments]의 [~TrainingArguments.per_device_train_batch_size] 옵션으로 설정합니다.
+배치 크기는 [`TrainingArguments`]의 [`~TrainingArguments.per_device_train_batch_size`] 옵션으로 설정합니다.
 
 ```py
 from transformers import TrainingArguments

From 925de8a0e75f7efa5ec3714221586be53c269cc0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=EB=B0=95=EC=A2=85=EB=B2=94?=
 <122260287+D15M4S@users.noreply.github.com>
Date: Thu, 24 Jul 2025 10:04:19 +0900
Subject: [PATCH 5/8] Update docs/source/ko/perf_train_gpu_one.md

fix: remove space between heading and GPU anchor

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>
---
 docs/source/ko/perf_train_gpu_one.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index f063b0382853..6f6af65e6ce9 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -13,7 +13,7 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# GPU [[gpu]]
+# GPU[[gpu]]
 
 GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 모델 학습에 널리 사용됩니다. GPU 사양과 모델 크기에 따라 수십억 개 매개변수를 가진 모델도 학습할 수 있습니다. 핵심은 GPU 메모리 활용도(데이터 처리량/학습 시간)와 학습 속도 사이에서 최적의 균형을 찾는 것입니다.
 

From 6c97485cc133a0e547f2ff1d626ea29fc2eee93f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=EB=B0=95=EC=A2=85=EB=B2=94?=
 <122260287+D15M4S@users.noreply.github.com>
Date: Thu, 24 Jul 2025 10:06:48 +0900
Subject: [PATCH 6/8] Update docs/source/ko/perf_train_gpu_one.md

fix: clarify table headers to indicate training speed boost and memory savings

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>
---
 docs/source/ko/perf_train_gpu_one.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index 6f6af65e6ce9..52eadd9165c2 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -21,7 +21,7 @@ GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 
 
 아래 표를 참고하면 자신의 학습 시나리오에 적합한 기능을 빠르게 파악할 수 있습니다.
 
-| 기능                        | 학습 속도 | 메모리 사용량 |
+| 기능                        | 학습 속도 가속 | 메모리 사용량 절약 |
 | --------------------------- | --------- | ------------- |
 | 배치 크기                   | 예        | 예            |
 | 그레이디언트 누적           | 아니요    | 예            |

From 205b05ce6933461e93b7a7f8b8db266de1558282 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=EB=B0=95=EC=A2=85=EB=B2=94?=
 <122260287+D15M4S@users.noreply.github.com>
Date: Fri, 25 Jul 2025 19:53:33 +0900
Subject: [PATCH 7/8] Update docs/source/ko/perf_train_gpu_one.md

fix: improve readability

Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
---
 docs/source/ko/perf_train_gpu_one.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index 52eadd9165c2..1f755d2544e7 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -17,7 +17,7 @@ rendered properly in your Markdown viewer.
 
 GPU는 높은 메모리 대역폭과 병렬 처리 능력 덕분에 딥러닝 모델 학습에 널리 사용됩니다. GPU 사양과 모델 크기에 따라 수십억 개 매개변수를 가진 모델도 학습할 수 있습니다. 핵심은 GPU 메모리 활용도(데이터 처리량/학습 시간)와 학습 속도 사이에서 최적의 균형을 찾는 것입니다.
 
-이 가이드는 Transformers와 PyTorch에서 GPU에서 모델을 효율적으로 학습하기 위해 제공하는 기능을 소개합니다. 대부분 이러한 기능을 조합하여 학습을 최적화하는 것이 좋습니다.
+이 가이드는 Transformers와 PyTorch에서 GPU를 활용해 모델을 효율적으로 학습하기 위해 제공하는 기능을 소개합니다. 대부분의 경우, 이 기능들을 조합해서 학습을 최적화하는 것이 좋습니다.
 
 아래 표를 참고하면 자신의 학습 시나리오에 적합한 기능을 빠르게 파악할 수 있습니다.
 

From b23ce763467445071a53db7617129175fbb57c44 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=EB=B0=95=EC=A2=85=EB=B2=94?=
 <122260287+D15M4S@users.noreply.github.com>
Date: Fri, 25 Jul 2025 19:55:28 +0900
Subject: [PATCH 8/8] Update docs/source/ko/perf_train_gpu_one.md

fix : rephrase explanation of data preloading to improve readability

Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
---
 docs/source/ko/perf_train_gpu_one.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/ko/perf_train_gpu_one.md b/docs/source/ko/perf_train_gpu_one.md
index 1f755d2544e7..656da59adbcd 100644
--- a/docs/source/ko/perf_train_gpu_one.md
+++ b/docs/source/ko/perf_train_gpu_one.md
@@ -193,7 +193,7 @@ args = TrainingArguments(
 
 ### 데이터 사전 적재[[data-preloading]]
 
-데이터 사전 적재는 GPU가 지속적으로 작업할 수 있도록 CPU에서 미리 배치 단위의 데이터를 적재하고 준비하는 기능입니다. 이를 통해 GPU 유휴 시간을 줄이고 활용도를 높일 수 있습니다. GPU가 항상 작업을 계속하도록 하려면 다음 데이터 사전 적재를 위한 두 가지 방법을 사용할 수 있습니다.
+데이터 사전 적재(Data preloading)는 GPU가 지속적으로 작업할 수 있도록 CPU에서 미리 배치 단위의 데이터를 적재하고 준비하는 기능입니다. 이를 통해 GPU 유휴 시간을 줄이고 활용도를 높일 수 있습니다. GPU가 항상 작업을 계속하도록 하려면 다음 데이터 사전 적재를 위한 두 가지 방법을 사용할 수 있습니다.
 
 1. 데이터를 저장할 고정 메모리를 CPU에 할당한 뒤, 이를 GPU로 직접 전송합니다.
 2. CPU 스레드 및 워커 수를 늘려 데이터를 더 빠르게 사전 적재합니다.