Show HN: Benchmarking VLMs vs. Traditional OCR
getomni.aiVision models have been gaining popularity as a replacement for traditional OCR. Especially with Gemini 2.0 becoming cost competitive with the cloud platforms.
We've been continuously evaluating different models since we released the Zerox package last year (https://github.com/getomni-ai/zerox). And we wanted to put some numbers behind it. So we’re open sourcing our internal OCR benchmark + evaluation datasets.
Full writeup + data explorer here: https://getomni.ai/ocr-benchmark
Github: https://github.com/getomni-ai/benchmark
Huggingface: https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Couple notes on the methodology:
1. We are using JSON accuracy as our primary metric. The end goal is to evaluate how well each OCR provider can prepare the data for LLM ingestion.
2. This methodology differs from a lot of OCR benchmarks, because it doesn't rely on text similarity. We believe text similarity measurements are heavily biased towards the exact layout of the ground truth text, and penalize correct OCR that has slight layout differences.
3. Every document goes Image => OCR => Predicted JSON. And we compare the predicted JSON against the annotated ground truth JSON. The VLMs are capable of Image => JSON directly, we are primarily trying to measure OCR accuracy here. Planning to release a separate report on direct JSON accuracy next week.
This is a continuous work in progress! There are at least 10 additional providers we plan to add to the list.
The next big roadmap items are: - Comparing OCR vs. direct extraction. Early results here show a slight accuracy improvement, but it’s highly variable on page length.
- A multilingual comparison. Right now the evaluation data is english only.
- A breakdown of the data by type (best model for handwriting, tables, charts, photos, etc.)
OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.
Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file
What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?