Add a script to run lm-eval on OSS #91

huydhn · 2025-10-07T07:01:34Z

The first iteration of the script is at vllm-eval-harness/run_vllm_eval_harness.py. The rest are lm-eval configurations modified from the format used by vLLM CI https://github.com/vllm-project/vllm/tree/main/.buildkite/lm-eval-harness/configs. There are a couple of tweaks in the format:

Specify the device type, i.e. B200, so that we can run the same task on multiple devices on CI if needed
Specify --tensor-parallel-size to dictate how many devices are need to evaluate the model
The config directory is structured in the ORG/MODEL format to make it easier to find the right config

Testing

This script can be run locally on B200 with

# Install vLLM and PyToch goes without saying
pip install lm-eval==0.4.9.1

# Run everything that can be run under configs directory
python3 run_vllm_eval_harness.py --configs-dir configs
# Or just run the selected list of models
python3 run_vllm_eval_harness.py --configs-dir config --models "openai/gpt-oss-20b,openai/gpt-oss-120b"
# There is only one lm-eval tasks at the moment, which is gsm8k, more can be added later

Next steps

Run lm-eval for these models in 2.9 RC vs 2.8 baseline
CI workflow will be added later in a separate PR
I don't have all the evaluation baseline at the moment, so will leave them empty for now and will populate them using CI results from step (1) once they are available

cc @zhewenl @yeqcharlotte

Signed-off-by: Huy Do <huydhn@gmail.com>

seemethere · 2025-10-07T22:51:11Z

~~Any reason to store the configs in yaml at all?~~

~~Would it be better to just define these in python?~~

I see that we're just using what upstream vllm is using.

huydhn · 2025-10-07T23:41:12Z

I see that we're just using what upstream vllm is using.

Yeah, this is just to follow the same format the vLLM is using with more models added

yeqcharlotte · 2025-10-08T08:20:06Z

vllm-eval-harness/run_vllm_eval_harness.py

+    model_name: str, tasks: List[str], tp_size: int, config: Dict[str, Any]
+) -> Dict[str, Any]:
+    trust_remote_code = config.get("trust_remote_code", False)
+    max_model_len = config.get("max_model_len", 8192)


this likely will impact the result. ideally it's set in auto.

It looks like vLLM lm-eval does like that value and ends up with this error:

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] INFO 10-10 01:22:45 [__init__.py:215] Automatically detected platform cuda. Traceback (most recent call last): File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 186, in <module> main() File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 182, in main run_lm_eval(args.configs_dir, models, tasks) File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 164, in run_lm_eval results = run(model_name, selected_tasks, tp_size, config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 105, in run return lm_eval.simple_evaluate( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/lm_eval/utils.py", line 456, in _wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/lm_eval/evaluator.py", line 245, in simple_evaluate lm = lm_eval.api.registry.get_model(model).create_from_arg_string( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/lm_eval/api/model.py", line 155, in create_from_arg_string return cls(**args, **args2) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/lm_eval/models/vllm_causallms.py", line 170, in __init__ "max_model_len": int(self._max_length) if self._max_length else None, ^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: 'auto'

Let me take a closer look.

huydhn added 4 commits October 6, 2025 21:57

Add a script to run lm-eval

4f217e5

Signed-off-by: Huy Do <huydhn@gmail.com>

Minor tweak

3e92127

Signed-off-by: Huy Do <huydhn@gmail.com>

Add existing models

17cc82f

Signed-off-by: Huy Do <huydhn@gmail.com>

Rename

ea9a612

Signed-off-by: Huy Do <huydhn@gmail.com>

meta-cla bot added the cla signed label Oct 7, 2025

huydhn requested a review from yangw-dev October 7, 2025 07:06

huydhn changed the title ~~Add a script to run lm-eval~~ Add a script to run lm-eval on OSS Oct 7, 2025

huydhn requested review from atalman and seemethere October 7, 2025 22:41

atalman approved these changes Oct 7, 2025

View reviewed changes

yeqcharlotte reviewed Oct 8, 2025

View reviewed changes

Set max_model_len to auto

8177b05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a script to run lm-eval on OSS #91

Add a script to run lm-eval on OSS #91

Uh oh!

huydhn commented Oct 7, 2025 •

edited

Loading

Uh oh!

seemethere commented Oct 7, 2025 •

edited

Loading

Uh oh!

huydhn commented Oct 7, 2025

Uh oh!

yeqcharlotte Oct 8, 2025

Uh oh!

huydhn Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add a script to run lm-eval on OSS #91

Are you sure you want to change the base?

Add a script to run lm-eval on OSS #91

Uh oh!

Conversation

huydhn commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Next steps

Uh oh!

seemethere commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Oct 7, 2025

Uh oh!

yeqcharlotte Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huydhn commented Oct 7, 2025 •

edited

Loading

seemethere commented Oct 7, 2025 •

edited

Loading