Skip to content

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Oct 7, 2025

The first iteration of the script is at vllm-eval-harness/run_vllm_eval_harness.py. The rest are lm-eval configurations modified from the format used by vLLM CI https://github.com/vllm-project/vllm/tree/main/.buildkite/lm-eval-harness/configs. There are a couple of tweaks in the format:

  • Specify the device type, i.e. B200, so that we can run the same task on multiple devices on CI if needed
  • Specify --tensor-parallel-size to dictate how many devices are need to evaluate the model
  • The config directory is structured in the ORG/MODEL format to make it easier to find the right config

Testing

This script can be run locally on B200 with

# Install vLLM and PyToch goes without saying
pip install lm-eval==0.4.9.1

# Run everything that can be run under configs directory
python3 run_vllm_eval_harness.py --configs-dir configs
# Or just run the selected list of models
python3 run_vllm_eval_harness.py --configs-dir config --models "openai/gpt-oss-20b,openai/gpt-oss-120b"
# There is only one lm-eval tasks at the moment, which is gsm8k, more can be added later

Next steps

  1. Run lm-eval for these models in 2.9 RC vs 2.8 baseline
  2. CI workflow will be added later in a separate PR
  3. I don't have all the evaluation baseline at the moment, so will leave them empty for now and will populate them using CI results from step (1) once they are available

cc @zhewenl @yeqcharlotte

huydhn added 4 commits October 6, 2025 21:57
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@meta-cla meta-cla bot added the cla signed label Oct 7, 2025
@huydhn huydhn requested a review from yangw-dev October 7, 2025 07:06
@huydhn huydhn changed the title Add a script to run lm-eval Add a script to run lm-eval on OSS Oct 7, 2025
@huydhn huydhn requested review from atalman and seemethere October 7, 2025 22:41
@seemethere
Copy link
Member

seemethere commented Oct 7, 2025

Any reason to store the configs in yaml at all?

Would it be better to just define these in python?

I see that we're just using what upstream vllm is using.

@huydhn
Copy link
Contributor Author

huydhn commented Oct 7, 2025

I see that we're just using what upstream vllm is using.

Yeah, this is just to follow the same format the vLLM is using with more models added

model_name: str, tasks: List[str], tp_size: int, config: Dict[str, Any]
) -> Dict[str, Any]:
trust_remote_code = config.get("trust_remote_code", False)
max_model_len = config.get("max_model_len", 8192)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this likely will impact the result. ideally it's set in auto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like vLLM lm-eval does like that value and ends up with this error:

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-10 01:22:45 [__init__.py:215] Automatically detected platform cuda.
Traceback (most recent call last):
  File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 186, in <module>
    main()
  File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 182, in main
    run_lm_eval(args.configs_dir, models, tasks)
  File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 164, in run_lm_eval
    results = run(model_name, selected_tasks, tp_size, config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/pytorch-integration-testing/vllm-eval-harness/run_vllm_eval_harness.py", line 105, in run
    return lm_eval.simple_evaluate(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/utils.py", line 456, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/evaluator.py", line 245, in simple_evaluate
    lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/api/model.py", line 155, in create_from_arg_string
    return cls(**args, **args2)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/models/vllm_causallms.py", line 170, in __init__
    "max_model_len": int(self._max_length) if self._max_length else None,
                     ^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'auto'

Let me take a closer look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants