Optical Character Recognition (OCR)
313 papers with code • 5 benchmarks • 42 datasets
Optical Character Recognition or Optical Character Reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo, license plates in cars...) or from subtitle text superimposed on an image (for example: from a television broadcast)
Libraries
Use these libraries to find Optical Character Recognition (OCR) models and implementationsSubtasks
Latest papers
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
The DeepSeek-VL family (both 1. 3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
Syntactic Language Change in English and German: Metrics, Parsers, and Convergences
Even though we have evidence that recent parsers trained on modern treebanks are not heavily affected by data 'noise' such as spelling changes and OCR errors in our historic data, we find that results of syntactic language change are sensitive to the parsers involved, which is a caution against using a single parser for evaluating syntactic language change as done in previous work.
TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming
In order to solve this problem, we propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework.
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.
MouSi: Poly-Visual-Expert Vision-Language Models
This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs.
Efficient Multi-domain Text Recognition Deep Neural Network Parameterization with Residual Adapters
Recent advancements in deep neural networks have markedly enhanced the performance of computer vision tasks, yet the specialized nature of these networks often necessitates extensive data and high computational power.
An Empirical Study of Scaling Law for OCR
The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP).
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs.