Skip to content

Conversation

Liuziyu77
Copy link

olmOCRBench is a document understanding and OCR-oriented benchmark designed for evaluating large vision-language models (LVLMs) on real-world document parsing and text recognition tasks.
This PR integrates the olmOCRBench benchmark into VLMEvalKit, providing a complete evaluation pipeline for the validation split.

Evaluated results (InternVL3.5-8B):

Model ArXiv Base Hdr/Ftr TinyTxt MultCol OldScan OldMath Tables Overall
InternVL3.5-8B 55.8 98.6 82.2 68.8 67.2 33.5 62.2 74.1 67.8

@Liuziyu77
Copy link
Author

Support Ocean-OCR Bench and MATBench.
Ocean-OCR Bench is a benchmark from Ocean-OCR, it measures edit distance, f1 score, recall and so on.

Model Edit Distance↓ (en) Edit Distance↓ (zh) F1-Score↑ (en) F1-Score↑ (zh) Precision↑ (en) Precision↑ (zh) Recall↑ (en) Recall↑ (zh) BLEU↑ (en) BLEU↑ (zh) METEOR↑ (en) METEOR↑ (zh)
InterVL3.5 (Github code) 0.1071 0.2971 0.7988 0.8078 0.8218 0.8506 0.7799 0.7823 0.6317 0.5198 0.7830 0.6971
InterVL3.5 (vlmevalkti) 0.1387 0.3331 0.7640 0.7888 0.7889 0.8512 0.7462 0.7476 0.6141 0.4840 0.7497 0.6656

MAT-Bench is a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs’ agentic search and coding abilities.

Metric Qwen2.5-VL-7B(VLMEvalKit) Qwen2.5-VL-7B(Github repo)
code_f1_avg 34.41 32.12
code_em_avg 22.00 21.50
search_f1_avg 50.16 53.49
search_em_avg 43.62 46.67

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant