Image captioning is both an exciting and challenging task as it combines two major fields in artificial intelligence: computer vision and natural language processing.
Thus an image captioning system must comprehend both image semantics and natural language. Colloquially, image captioning serves to describe objects in images.
In this case study, we shall use a convolutional neural network (CNN) encoder–recurrent neural network (RNN) decoder method. In this framework, a CNN is used to generate a fixed‐length vector representation, and an RNN to generate a visual description using this vector representation.
As a component of image captioning, image classification is a classic problem in computer vision, in its own right. A typical image classification task consists in, given an image, identifying it as being a member of one of several fixed classes.
It enters into numerous applications, including autonomous driving, through the critical primitive, namely, fast image classification. Moreover, in modern social media and photo sharing and/or storage applications such as Google Photos make abundant use of image classification techniques to personalise and enhance user experience.