Introduction
Text Detection is a major problem in optical character recognition (OCR) and there are various solutions attempted by different researchers. Connectionist Text Proposal Network (CTPN) being most successful in detection of Scenic Text detection that’s detection of text from images where background is normal street or billboard and model needs to detect text from that image. Later on, is realized that the same model can be very useful in text detection from scanned images as well, and that started the journey of CTPN implementation for text detection in OCR.
Why the Traditional Approach needs a revamp?
Current approaches for text detection mostly employ a bottom-up pipeline which implies that it starts from low-level character or stroke detection, which is typically followed by a number of subsequent steps: non-text component filtering, text line construction and text line verification. These multi-step bottom-up approaches are generally complicated with less robustness and reliability and are thus not widely adopted for text detection.
In addition to this, their performance is heavily dependent on the results of character detection, and connected-components methods or sliding-window methods that have been proposed. These methods commonly explore low-level features to distinguish text candidates from background. However, they are not robust by identifying individual strokes or characters separately, without context information. These limitations lead to a large number of non-text components in character detection, causing main difficulties for handling them in following steps. Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline.
New Approach
New approach defined in this paper directly localizes text sequences in convolutional layers. This overcomes a number of main limitations raised by previous bottom-up approaches building on character detection.
Let’s now look at the key points of this approach:
- The problem of text detection is being solved by localizing a sequence of fine scale text proposals. An anchor regression mechanism is developed that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy. This departs from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy.
- The approach proposes an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. This connection allows our detector to explore meaningful context information of text line, making it powerful to detect extremely challenging text reliably.
- This method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.
From a computation perspective, this approach has been able to achieve new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 on the ICDAR 2013, and 0.61 F-measure over 0.54 on the ICDAR 2015). Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model.
Connectionist Text Proposal Network (CTPN)
CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals.
If we look at the CTPN Architecture diagram above, below are the key highlights:
- We densely slide a 3×3 spatial window through the last convolutional maps (conv5) of the VGG16 model.
- The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM), where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs).
- The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors.
- The CTPN outputs sequential fixed-width fine-scale text proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.
If you want to go deeper into CTPN and learn how you can do training as well using ICDAR SIROE Dataset then you can learn it with Live Demo.
CTPN Implementation
Scene text detection based on ctpn (connectionist text proposal network). It is implemented in tensorflow. The origin paper can be found here. And github code can be downloaded from this Github link
Steps for CTPN Implementation:
Step 1. Clone the repository using following command in Ubuntu environment
git clone https://github.com/indiantechwarrior/text-detection-ctpn.git
Step 2. cd text-detection-ctpn
pip install -r requirements.txt (or pip3 install -r requirements.txt)
Make sure you have tensorflow==1.15.0 installed, command for same is:
pip install tensorflow==1.15.0 (or pip3 install tensorflow==1.15.0)
Step 3. cd utils/bbox
chmod +x make.sh
./make.sh
This will generate nms.so and a bbox.so in current folder
Running CTPN with pretrained model
Step 1. Download checkpoints file. The checkpoints file can be downloaded from the Google Drive
https://drive.google.com/file/d/1HcZuB_MHqsKhKEKpfF1pEU85CYy4OlWO/view
Step 2. Put checkpoints_mlt/ in text-detection-ctpn/ and images in data/demo
Step 3. Execute the demo python file python ./main/demo.py
Step 4. the results will be saved in data/res
Training own CTPN model
Step 1. Download the pre-trained model of VGG16 and put it in data/vgg_16.ckpt. you can download it from tensorflow/models
Step 2. Download the dataset we prepared from google drive. Put the downloaded data in data/dataset/mlt, then start the training.
Step 3. Also, you can prepare your own dataset according to the following steps. However, if you want to use existing dataset then skip to Step 7
Step 4. Modify the DATA_FOLDER and OUTPUT in utils/prepare/split_label.py according to your dataset. And run split_label.py in the root
python ./utils/prepare/split_label.py
it will generate the prepared data in data/dataset/
Step 5. The input file format demo of split_label.py can be found in gt_img_859.txt. And the output file of split_label.py is img_859.txt.
Step 6. Modify path for DATA_FOLDER in in utils/dataset/data_provider.py
Step 7. For using pre- trained checkpoints and training your dataset on top of it, update max_steps (60000) in main/train.py to higher number, adding 10000 normally is a good starting point.
Step 8. Execute the python code
python ./main/train.py
Step 9. Post completion of training you will find updated checkpoints which can than be utilized for training, note for using new checkpoints don’t forget to update filename in ‘checkpoint’ file (originally this came in handy with downloaded pre- trained checkpoints)
Now you can run python ./main/demo.py to validate result on updated checkpoints