Methodology for Human-Centric Evaluation of Part-Prototype Models

cover
17 May 2024

Authors:

(1) Omid Davoodi, Carleton University, School of Computer Science;

(2) Shayan Mohammadizadehsamakosh, Sharif University of Technology, Department of Computer Engineering;

(3) Majid Komeili, Carleton University, School of Computer Science.

Abstract and Intro

Background Information

Methodology

Prototype Interpretability

Prototype-query Similarity

Interpretability of the Decision-Making Process

The Effects of Low Prototype Counts

Discussions

Methodology

The interpretability of a prototype-based machine learning method as a whole is too broad to be effectively evaluated. As a result, our experiments were designed to measure the human opinion on the individual properties required for an interpretable method. This focus on single aspects allowed us to gain fine-grained insight into how much human users understand and agree with the explanations of different methods.

Human annotators were recruited via Amazon Mechanical Turk. To ensure that workers have a good understanding of the task, they were required to first complete a qualification test. Furthermore, to ensure the quality of the responses and reduce noise, we inserted a few validation samples into our tasks and excluded workers who did not pass them.

We trained seven methods on three different datasets. These methods include six part-prototype based classification methods (ProtoPNet[9], Deformable ProtoPNet[18], SPARROW[17], ProtoTree[19], TesNet[20], and ProtoPool[21]) and one unsupervised method (ACE22). The unsupervised method (ACE) is normally used to extract concepts from a dataset. Each concept is a part of a training sample and therefore is analogous to the concept of prototypes in part-prototype-based methods. As a result, we decided to include it in our experiments as a method representing unsupervised approaches to this problem and also as a baseline.

The datasets on which we trained were CUB-200[23], Stanford Cars[24], and a subset of the Imagenet dataset[25] consisting of 7 classes of different felines and canines. The details for our Imagenet subset are included in the appendix. All datasets were augmented in the same process described in the work by Chen et al[9]. We also used the public code repositories from the authors of the papers of those methods in all cases except SPARROW. In the case of SPARROW, the authors kindly agreed to give us access to their non-public codebase. We used the same hyperparameters provided by these codebases, only changing those values when needed for different datasets. In particular, we used 10 prototypes for each class when training on datasets where the original codebase did not contain hyperparameters. In the special case of ProtoPool, we used 12 prototypes in total for the ImageNet dataset. We trained all classification methods until they reached at least 70% top-1 classification accuracy on the dataset. For Imagenet, our criterion was higher at 90% due to the lower number of classes and the fact that these backbone networks were originally trained on ImageNet in the first place. All methods used ResNet-18 as their backbone network. Something to note is that ProtoPool failed to reach the 70% accuracy level for the Cars dataset, instead only reaching a Top-1 accuracy of 49.53%.

The next step was to gather a suite of query samples from the dataset and generate the explanations given by the methods for classifying those samples. For CUB and Cars, we used the first image found in the directory of each separate class. For Imagenet, we used the first 15 images in order to get enough samples for later use.

One thing we noticed at this stage was that the usual heatmaps provided by many of these methods to denote the location of a prototype or its activation on a query image tended to occlude a large portion of the image itself. Some, like ProtoPNet, tended to draw a rectangle to denote the general boundary of the prototype or its activation. We found this approach to be quite misleading because there were cases where the rectangles could cover more area than the actual activation site. To address this, we opted for a more fine-grained approach where we created boundaries around the area of the activation heatmap with activation values above 70%. Figure 3 illustrates examples of complex activation sites, failures of the original ProtoPNet region selection approach, and our approach to address the problem.

This paper is available on arxiv under CC 4.0 license.