DeepCompare: Visual and Interactive Comparison of Deep Learning Model Performance

IEEE Comput Graph Appl. 2019 Sep-Oct;39(5):47-59. doi: 10.1109/MCG.2019.2919033. Epub 2019 May 27.

Abstract

Deep learning models have become the state-of-the-art for many tasks, from text sentiment analysis to facial image recognition. However, understanding why certain models perform better than others or how one model learns differently than another is often difficult yet critical for increasing their effectiveness, improving prediction accuracy, and enabling fairness. Traditional methods for comparing models' efficacy, such as accuracy, precision, and recall provide a quantitative view of performance; however, the qualitative intricacies of why one model performs better than another are hidden. In this paper, we interview machine learning practitioners to understand their evaluation and comparison workflow. From there, we iteratively design a visual analytic approach, DeepCompare, to systematically compare the results of deep learning models, in order to provide insight into the model behavior and interactively assess tradeoffs between two such models. The tool allows users to evaluate model results, identify and compare activation patterns for misclassifications, and link the test results back to specific neurons. We conduct a preliminary evaluation through two real-world case studies to show that experts can make more informed decisions about the effectiveness of different types of models, understand in more detail the strengths and weaknesses of the models, and holistically evaluate the behavior of the models.