Learning Structural Representations via Dynamic Object Landmarks Discovery for Sketch Recognition and Retrieval

IEEE Trans Image Process. 2019 Apr 19. doi: 10.1109/TIP.2019.2910398. Online ahead of print.


State-of-the-art methods on sketch classification and retrieval are based on deep convolutional neural network to learn representations. Although deep neural networks have the ability to model images with hierarchical representations by convolution kernels, they can not automatically extract the structural representations of object categories in a human-perceptible way. Furthermore, sketch images usually have large scale visual variations caused by the styles of drawing or viewpoints, which make it difficult to develop generalized representations using the fixed computational mode of convolutional kernel. In this paper, our aim is to address the problem of fixed computational mode in feature extraction process without extra supervision. We propose a novel architecture to dynamically discover the object landmarks and learn the discriminative structural representations. Our model is composed of two components: a representative landmark discovering module that localizes the key points on the object, and a category-aware representation learning module that develops the category-specific features. Specifically, we develop a structure-aware offset layer to dynamically localize the representative landmarks, which is optimized based on the category labels without extra supervision. After that, a diversity branch is introduced to extract the global discriminative features for each category. Finally, we employ a multi-task loss function to develop an end-to-end trainable architecture. At testing time, we fuse all the predictions with different number of landmarks to achieve the final results. Through extensive experiments, we compare our model with several state-of-the-art methods on two challenging datasets TU-Berlin and Sketchy for sketch classification and retrieval, and the experimental results demonstrate the effectiveness of our proposed model.