MahenOCR: Commercial-Grade OCR with a 1B Parameter VLM

Artifacts Available:

View Code

HuggingFace Collection

1. Introduction

Modern Optical Character Recognition (OCR) has evolved far beyond simple text extraction. Today's applications require handling complex document parsing, information extraction (IE), and text-centric visual question answering (VQA).

Traditional OCR systems typically rely on cascaded pipelines—stringing together separate modules for text detection, recognition, and layout analysis. While modular, these pipelines suffer from error propagation, where a failure in detection cascades through the entire system, and high maintenance complexity.

Performance Radar Chart — **Figure 1: Performance Comparison.** MahenOCR (1B) achieves competitive or superior performance across multiple benchmarks compared to traditional pipelines and larger VLMs.

2. Architecture

MahenOCR represents a shift to a pure end-to-end Vision-Language Model (VLM) architecture. By unifying diverse tasks into a single framework, we eliminate the need for intermediate pre-processing modules like layout analysis.

MahenOCR achieves high efficiency with a compact 1B parameter footprint, making it suitable for edge and low-latency applications. The architecture consists of three core components:

Native Resolution Visual Encoder: Built on SigLIP-v2-400M, using an adaptive patching mechanism to preserve aspect ratio.
Adaptive MLP Connector: A learnable bridge compressing visual tokens to reduce redundancy.
Lightweight Language Model: Based on the Metanthropic-0.5B model, utilizing XD-RoPE for robust spatial reasoning.

MahenOCR Architecture — **Figure 2: Model Architecture.** The end-to-end flow from high-resolution visual encoding to auto-regressive text generation.

3. Methodology & Pipeline

Our approach streamlines the traditional multi-stage pipeline into a unified differentiable flow. This allows the model to leverage the reasoning capabilities of Large Language Models (LLMs) for superior performance on cognitive-intensive tasks.

End-to-End Pipeline — **Figure 3: Pipeline Comparison.** Contrasting the MahenOCR end-to-end approach with traditional cascaded OCR systems.

Data Pipeline & RLVR

A key innovation is the application of Reinforcement Learning with Verifiable Rewards (RLVR). We employ Group Relative Policy Optimization (GRPO) to fine-tune the model on objective metrics:

Spotting: Intersection over Union (IoU) rewards.
Parsing: Structural integrity rewards (HTML/LaTeX validity).

Data Processing Pipeline — **Figure 4: Data Construction & Training Pipeline.** From raw document sourcing to RL-based fine-tuning.

4. Unified Capabilities

MahenOCR consolidates a wide range of capabilities into a single model, demonstrating robust performance across diverse document intelligence tasks.

4.1 Text Spotting

Jointly detecting and recognizing text with precise coordinate outputs.

Text Spotting Visualization — Figure 5: Multi-oriented text spotting samples.

4.2 Document Parsing

End-to-end conversion of documents into structured Markdown, with tables in HTML and formulas in LaTeX.

Document Parsing Example — Figure 6: Complex layout parsing results.

4.3 Translation & Information Extraction

Direct image-to-text translation and structured JSON extraction from receipts and IDs.

Document Translation — Figure 7: End-to-End Translation

5. Conclusion

MahenOCR demonstrates that a well-designed, data-driven 1B parameter model can rival the performance of significantly larger models and commercial APIs. By combining a streamlined end-to-end architecture with targeted reinforcement learning, we provide a robust, open-source foundation for the next generation of industrial OCR applications.

OCRVision-Language ModelsReinforcement LearningEfficient AI

Author

Metanthropic&Ekjot Singh

Research Lab • Director & Lead Researcher

Connect