Text detection and recognition in natural scene images is a challenging computer vision task. It may occur in a background cluttered with diverse image tokens, texts of various font types and sizes as well as character-like images. It is an essential component in numerous AI and computer vision systems, such as autonomous robots, multilanguage machine translation based on image inputs and assistive technology generally.
In this case study, we leverage convolutional neural networks (CNN) for robust text spotting. In particular, we use CNNs to recognise the words, and RNNs to decode the CNN output sequence into a word string.