Transformer-Based OCR Model: How OCR Decoder works
As you probably already know, Optical Character Recognition (OCR) is the electronic conversion of images of typed, handwritten, or printed text into machine-encoded text. The source can be a scanned document, a photo of a document, or a subtitle text imposed on an image. OCR converts such sources into machine-readable text.
Let’s understand how an OCR pipeline works before we dig deeper into Transformer Based OCR.
A typical OCR pipeline consists of two modules.
- A Text Detection Module
- A Text Recognition Module
Text Detection Module
Text Detection module as the name suggests detects where text is present in the source. It aims to localize all the text blocks within the text image, either at word level (individual words) or text line level.
This task is comparable to an object detection problem only here the object of interest is the text blocks. Popular object detection algorithms include YOLOv4/5, Detectron, Mask-RCNN, etc.
To understand Object Detection using YOLO click here.
Text Recognition Module
Text Recognition module aims to understand the content of the detected text block and convert the visual signals into natural language tokens.
A typical text recognition module consists of two sub-modules.
- Word Piece Generation Module
- Image Understanding
The workflow under the text recognition module works as follows.
- The individual localized text boxes are resized to, let's say, 224x224 and passed as input to the image understanding module which is typically a CNN module (ResNet with self-attention).
- The image features from a particular network depth are extracted and passed as input to the Word Piece Generation Module, which is an RNN based network. The output of this RNN network is machine-encoded texts of the localized text boxes.
- Using an appropriate loss function, the Text Recognition Module is trained until the performance reaches an optimal scale.
What makes transformer-based OCR different?
Transformer-based OCR is an end-to-end transformer-based OCR model for text recognition, this is one of the first works to jointly leverage pre-trained image and text transformers.
Transformed-based OCR looks like the diagram below. The left-Hand side of the diagram is the Vision Transformer Encoder and the Right-Hand side of the image is the Roberta (Text Transformer) Decoder.
ViTransformer or Encoder :
An image is split into NxN patches, where each patch is treated similarly to a token in a sentence. The image patches are flattened (2D → 1D) and are linearly projected with positional embeddings. The linear projection + positional embeddings are propagated through the transformer encoder layers.
In the case of OCR, the image is a series of localized text boxes. To ensure consistency in localized text boxes, the images/image region of the text boxes are resized to a HxW. After which the image is decomposed into patches, where each patch size HW/(PxP). P is the patch size.
After that, the patches are flattened and linearly projected to a D-Dimensional vector which are patch embeddings. The patch embeddings and two special tokens are given learnable 1D position embeddings according to their absolute positions. Then, the input sequence is passed through a stack of identical encoder layers.
Each Transformer layer has a multi-head self-attention module and a fully connected feed-forward network. Both of these two parts are followed by residual connection and layer normalization.
Note: Residual connections ensure gradient flow during backpropagation.
Roberta or Decoder :
The output embeddings from a certain depth of the ViTransformers are extracted & passed as input to the decoder module.
The output embeddings from a certain depth of the ViTransformers are extracted and passed as input to the decoder module.
The decoder module is also a transformer with a stack of identical layers that have similar structures to the layers in the encoder, except that the decoder inserts the “encoder-decoder attention” between the multi-head self-attention and feedforward network to distribute different attention on the output of the encoder. In the encoder-decoder attention module, the keys and values come from the encoder output, while the queries come from the decoder input.
The embeddings from the decoder are projected from the model dimension (768) to the dimension of vocabulary size V (50265).
The softmax function calculates the probabilities over the vocabulary and we use beam search to get the final output.
Advantages:
- TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models is the first work that jointly leverages pre-trained image and text Transformers for the text recognition task in OCR.
- TrOCR achieves state-of-the-art accuracy with a standard transformer-based encoder-decoder model, which is convolution free and does not rely on any complex pre/post-processing step.
References:
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
https://arxiv.org/pdf/2109.10282.pdf
An image is worth 16X16 words: Transformers for Image Recognition at Scale
FAQs
Using AI for pre-fund QC audits offers the advantage of quickly verifying that loans meet all regulatory and internal guidelines without any errors. AI enhances accuracy, reduces the risk of errors or fraud, reduces the audit time by half, and streamlines the review process, ensuring compliance before disbursing funds.
Choose software that offers advanced automation technology for efficient audits, strong compliance features, customizable audit trails, and real-time reporting. Ensure it integrates well with your existing systems and offers scalability, reliable customer support, and positive user reviews.
Audit Quality Control (QC) is crucial for mortgage companies to ensure regulatory compliance, reduce risks, and maintain investor confidence. It helps identify and correct errors, fraud, or discrepancies, preventing legal issues and defaults. QC also boosts operational efficiency by uncovering inefficiencies and enhancing overall loan quality.
Mortgage review/audit QC software is a collective term for tools designed to automate and streamline the process of evaluating loans. It helps financial institutions assess the quality, compliance, and risk of loans by analyzing loan data, documents, and borrower information. This software ensures that loans meet regulatory standards, reduces the risk of errors, and speeds up the review process, making it more efficient and accurate.
IDP (Intelligent Document Processing) enhances audit QC by automatically extracting and analyzing data from loan files and documents, ensuring accuracy, compliance, and quality. It streamlines the review process, reduces errors, and ensures that all documentation meets regulatory standards and company policies, making audits more efficient and reliable.
Yes, AI can identify and extract changes in revised engineering drawings, tracking modifications to ensure accurate updates across all documentation.