-
Notifications
You must be signed in to change notification settings - Fork 998
Add MFU logging support #6434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MFU logging support #6434
Conversation
Summary of ChangesHello @y2logic, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the training pipeline by integrating Model FLOPs Utilization (MFU) logging. This feature aims to provide developers and researchers with critical performance metrics, enabling them to better understand and optimize the efficiency of their hardware during model training. The implementation offers a flexible approach to determining theoretical device TFLOPS, either through direct user input via an environment variable or through an intelligent automatic estimation process, ensuring robust MFU reporting across different setups. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a PerfMetricsLogCallback to compute and log Model FLOPs Utilization (MFU), which is a valuable addition for performance monitoring. The implementation is well-structured, offering device TFLOPS estimation and configuration via environment variables. My review focuses on improving robustness, maintainability, and consistency. I've identified a bug in a helper function, suggested refactoring to reduce code duplication, recommended leveraging existing utilities, and pointed out some dead code. Addressing these points will enhance the quality and reliability of the new callback.
|
Thank you for the PR. Please change the Chinese comments to English comments. |
swift/plugin/callback.py
Outdated
| 'RTX 3070 Ti': 21.75e12 | ||
| } | ||
|
|
||
| extra_callbacks = [PerfMetricsLogCallback()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not enable by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. Is a switch argument required?
All comments are now in English, and I have left the extra_callbacks untouched. |
|
Use Python < 3.12 and run the code below pip install pre-commit
pre-commit run --all-files |
IDE formatting problem. It has been corrected. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a PerfMetricsLogCallback to compute and log Model FLOPs Utilization (MFU), which is a valuable metric for performance monitoring. The overall approach is sound, but there are several areas for improvement regarding correctness, robustness, and efficiency. My review highlights a critical bug in the TFLOPS estimation for CPU environments, points out that the efficient lookup-table-based TFLOPS retrieval is implemented but unused and buggy, and suggests other improvements like handling potential division-by-zero errors and using more appropriate data types. Addressing these points will make the new callback more robust and reliable.
| elif backend == 'cpu': | ||
| torch.cpu.synchronize(sync_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper function device_synchronize has a bug when the backend is 'cpu'. torch.cpu.synchronize does not exist in PyTorch, and calling it will cause an AttributeError. CPU operations are typically synchronous, so an explicit synchronization is not needed.
| elif backend == 'cpu': | |
| torch.cpu.synchronize(sync_device) | |
| elif backend == 'cpu': | |
| pass # CPU operations are synchronous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function torch.cpu.synchronize exists and have been tested.
| def _retrieve_flops_from_map(device): | ||
| """Retrieve theoretical FLOPS from Map. """ | ||
|
|
||
| device_name = device.get_device_name() | ||
| flops = None | ||
| for name, value in device_flops_map.items(): | ||
| if name in device_name: | ||
| flops = value | ||
| break | ||
|
|
||
| return flops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has a few issues and is currently not used:
device.get_device_name()will raise anAttributeErrorastorch.deviceobjects do not have this method. You should usetorch.cuda.get_device_name(device)for CUDA devices and handle other device types gracefully.- The values in
device_flops_mapare in FLOPS, but this function should return TFLOPS to be consistent with other parts of the callback. The result should be divided by1e12. - This function is currently unused. It should be called in
on_init_endto provide an efficient way to determine TFLOPS.
def _retrieve_flops_from_map(device):
"""Retrieve theoretical FLOPS from Map and convert to TFLOPS."""
if device.type != 'cuda':
# Add other supported device types like 'npu' if needed
return None
device_name = torch.cuda.get_device_name(device)
flops = None
for name, value in device_flops_map.items():
if name in device_name:
flops = value
break
if flops is not None:
return flops / 1e12
return None| from swift.utils import get_current_device, get_device_count, get_env_args | ||
|
|
||
| # Top priority. Specify by ENV | ||
| tflops = get_env_args('DEVICE_TFLOPS', int, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEVICE_TFLOPS is parsed as an integer, which might be too restrictive as TFLOPS values are often floating-point numbers (e.g., from the estimation or lookup table). Using float would be more appropriate and consistent.
| tflops = get_env_args('DEVICE_TFLOPS', int, None) | |
| tflops = get_env_args('DEVICE_TFLOPS', float, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems float is correct.
| else: | ||
| # Run a estimating test. | ||
| dtype = kwargs.get('model').dtype | ||
| device = torch.device(get_current_device()) | ||
| logger.info(f'Estimating device TFLOPS baseline. Device: [{device}] dtype: [{dtype}]') | ||
| tflops = self._estimate_device_tflops_by_dtype(device, dtype) | ||
| logger.info(f'Estimate test finished. [{tflops} TFLOPS] Device count: [{device_count}]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _retrieve_flops_from_map function provides an efficient way to get device TFLOPS from a lookup table. It should be called as a fallback before running the performance estimation test. This avoids running a potentially time-consuming benchmark if the device's performance is already known.
| else: | |
| # Run a estimating test. | |
| dtype = kwargs.get('model').dtype | |
| device = torch.device(get_current_device()) | |
| logger.info(f'Estimating device TFLOPS baseline. Device: [{device}] dtype: [{dtype}]') | |
| tflops = self._estimate_device_tflops_by_dtype(device, dtype) | |
| logger.info(f'Estimate test finished. [{tflops} TFLOPS] Device count: [{device_count}]') | |
| else: | |
| device = torch.device(get_current_device()) | |
| tflops = self._retrieve_flops_from_map(device) | |
| if tflops is not None: | |
| device_name = torch.cuda.get_device_name(device) if device.type == 'cuda' else str(device) | |
| logger.info(f'Retrieved TFLOPS from lookup table for {device_name}: {tflops} TFLOPS') | |
| else: | |
| # Run an estimating test. | |
| dtype = kwargs.get('model').dtype | |
| logger.info(f'Estimating device TFLOPS baseline. Device: [{device}] dtype: [{dtype}]') | |
| tflops = self._estimate_device_tflops_by_dtype(device, dtype) | |
| logger.info(f'Estimate test finished. [{tflops} TFLOPS] Device count: [{device_count}]') |
|
|
||
| def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, logs=None, **kwargs): | ||
| total_flos = getattr(state, 'total_flos', 0) | ||
| actual_flops = total_flos / self.elapsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential ZeroDivisionError here if self.elapsed is 0. This could happen if on_log is called before on_step_end has completed. It's safer to add a guard for this case.
| actual_flops = total_flos / self.elapsed | |
| actual_flops = total_flos / self.elapsed if self.elapsed > 0 else 0 |
| return tflops | ||
|
|
||
| @staticmethod | ||
| def _retrieve_flops_from_map(device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this function is not being used. Just curious, what's the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More data should be collected, and hopefully the function _retrieve_flops_from_map can become a more efficient and accurate way to obtain device TFLOPS. This avoids running a potentially time-consuming benchmark if the device's performance is already known.
| def _retrieve_flops_from_map(device): | ||
| """Retrieve theoretical FLOPS from Map. """ | ||
|
|
||
| device_name = device.get_device_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the get_device_name function?
|
Hi, have all the changes been committed? I'm ready to merge. |
Yes. All changes committed. |
PR type
PR information
Implement a callback plugin to add MFU metrics to log.
Related to issue Add MFU (Model FLOPs Utilization) logging support #5791
Write the detail information belongs to this PR.
DEVICE_TFLOPS. (Higher priority)Experiment results
Paste your experiment result here(if needed).