Explaining the deep models has become easy especially with the introduction of path attribution methods, which are gradient-based. And, for this purpose, a required hyperparameter called the baseline input is preferred. What is the meaning of baseline input and what is its significance? We will be discussing this in detail. As a case study, we will use image classification networks and discuss different ways of choosing a baseline input. The hypotheses are implied in each baseline. The discussion of baselines has a close connection with the missingness concept in the feature space for interpretability research.
While you are training neural networks, the integrated gradients method has been used. It helps in computing features for making a specific prediction on a particular data point. It is used for computing various data types like images of retinal fundus images or ECG recordings.
This makes a data scientist wonder if these integrated gradients are sensitive to the choice of hyperparameter. Is the constant black image a “natural baseline” for image data? Is there any choice of alternate parameters? For this, we shall explore several ideas of missingness.
Image classification for plotting integrated gradients attributions
We use images for visually plotting integrated gradient attributions. These are then compared with our intuition. Inception V4 architecture is then used for assessing which pixels are important. ImageNet dataset helps to determine which class an image belongs to. Then we download weights from TensorFlow-Slim for testing the accuracy of the various images of the validation set. These attributions can be verified to the 99th percentile for avoiding high-scale attributions that dominate the colour scheme.
A Better Understanding of Integrated Gradients
Assessing the attribution maps might yield an unintuitive. Explore the method of generation of the feature attributions. Integrated gradients define the importance value and local gradients are thus accumulated on the interpolated images that range between the current input and baseline value.
Unfortunately, many problems arose with using gradients for interpreting Deep Neural Networks (DNN). One specific issue is that of saturation. The gradients of these input features have small magnitudes around a sample. It occurs when the network function flattens after reaching a certain magnitude.
Game Theory and Missingness
Integrated gradients have been derived from cooperative game theory, around the Aumann-Shapley value in particular. In cooperative game theory, a non-atomic game is a construction model which is used to design large-scale economic systems using continuous modelling with the help of Aumann-Shapley values. It is a stable way to determine how many different groups of participants can make system contributions. In this game theory, the notion of missingness is well-defined.
Are there any Alternatives of the Baseline Choices
Any constant colour baseline will face this problem of interpolation. So are there any possible substitutes? There are four of them. The four substitutes of feature attribution are given below -
- The Maximum Distance Baseline
Constant baselines are unable to distinguish the baseline colour, or they are blind baselines. The solution for this is to construct a baseline by taking the farthest image in the L1 distance in the valid pixel range. It is known as the maximum distance baseline.
- The Blurred Baseline
The maximum distance baseline suffers from an issue - it doesn’t represent missingness but it contains a lot of data from the original image. So we make the prediction relative to the original information of the baseline which is presented as a blurred version of the image.
- The Uniform Baseline
A blurred baseline is biassed for highlighting high-frequency data. To avoid this issue, we can use the original information or the data from the above two baseline attribution methods and make a uniform baseline. There is another way to define missingness in this case. It can be done by sampling a random uniform image in the valid pixel range and referring to that as the baseline.
- The Gaussian Baseline
The uniform distribution is not the only baseline method of drawing random noise. You can also use a Gaussian distribution which is calculated on the current image with a variance, σ. σ or ∑ or Sigma can be used for computing the blur and gaussian baselines. It also helps to smoothen the standard deviation and the kernel of the noise respectively.
Averaging over multiple baselines and the dangers of qualitative assessment
Some of these methods could create data blindness. To overcome these, it is better to average over multiple different baselines. All you need to do is simply draw more samples using the same distribution methods on all the baselines and compute the average of the crucial scores from each sample.
Qualitative assessment is dangerous because the knowledge of the relationship between the data and the labels is human and hence, naive and we require an accurate model for computing data.
In this case, we simply evaluate which methods best explain our network without going into the in-depth actions of the network.
Conclusion
So what can be done in this case? We have multiple baselines with no conclusive results. Using the quantitative results, we compare all of the following mentioned feature attribution baselines, we can afford to have a foundation for understanding these distribution systems. Each baseline is an assumption derived out of missingness in the models and the distribution of the data. We need to define this quantity because most ML models cannot handle random patterns of missing inputs.
Reference Links
https://distill.pub/2020/attribution-baselines/
https://github.com/distillpub/post--attribution-baselines