(Feature Importance)
PFI – sometimes called Model Reliance (Fisher, Rudin, and Dominici 2019) – quantifies importance of a feature by measuring the change in predictive error incurred when permuting its values for a collection of instances (Breiman 2001).
It communicates global (with respect to the entire explained model) feature importance.
PFI was originally introduced for Random Forests (Breiman 2001) and later generalised to a model-agnostic technique under the name of Model Reliance (Fisher, Rudin, and Dominici 2019).
Property | Permutation Feature Importance |
---|---|
relation | post-hoc |
compatibility | model-agnostic |
modelling | regression, crisp and probabilistic classification |
scope | global (per data set; generalises to cohort) |
target | model (set of predictions) |
Property | Permutation Feature Importance |
---|---|
data | tabular |
features | numerical and categorical |
explanation | feature importance (numerical reporting, visualisation) |
caveats | feature correlation, model’s goodness of fit, access to data labels, robustness (randomness of permutation) |
\[ I_{\textit{PFI}}^{j} = \frac{1}{N} \sum_{i = 1}^N \frac{\overbrace{\mathcal{L}(f(X^{(j)}), Y)}^{\text{permute feature j}}}{\mathcal{L}(f(X), Y)} \]
Some choices are
This explanation communicates how the model relies on data features during training, but not necessarily how the features influence predictions of unseen instances. The model may learn a relationship between a feature and the target variable that is due to a quirk of the training data – a random pattern present only in the training data sample that, e.g., due to overfitting, can add some extra performance just for predicting the training data.
The spurious correlations between data features and the target found uniquely in the training data or extracted due to overfitting are absent in the test data (previously unseen by the model). This allows PFI to communicate how useful each feature is for predicting the target, or whether some of the data feature contributed to overfitting.
We can measure feature importance with alternative techniques such as Partial Dependence-based feature importance. This metric may not pick up the random feature’s lack of predictive power since PD generates unrealistic instances that could follow the spurious pattern found in the training data.
Since the underlying predictive model (the one being explained) is a Decision Tree, we have access to its native estimate of feature importance. It conveys the overall decrease in the chosen impurity metric for all splits based on a given feature, by default calculated over the training data.
Assumes feature independence, which is often unreasonable
May not reflect the true feature importance since it is based upon the predictive ability of the model for unrealistic instances
In presence of feature interaction, the importance – that one of the attributes would accumulate if alone – may be distributed across all of them in an arbitrary fashion (pushing them down the order of importance)
Since it accounts for indiviudal and interaction importance, the latter component is accounted for multiple times, making the sum of the scores inconsistent with (larger than) the drop in predictive performance (for the difference-based variant)
PFI is parameterised by:
Generating PFI may be computationally expensive for large sets of data and high number of repetitions
Computational complexity: \(\mathcal{O} \left( n \times d \right)\), where
Many data-driven predictive models come equipped with some variant of feature importance. This includes Decision Trees and Linear Models among many others.
Partial Dependence captures the average response of a predictive model for a collection of instances when varying one of their features (Friedman 2001). By assessing flatness of these curves we can derive a feature importance measurement (Greenwell, Boehmke, and McCarthy 2018).
SHapley Additive exPlanations explains a prediction of a selected instance by using Shapley values to computing the contribution of each individual feature to this outcome (Lundberg and Lee 2017). It comes with various aggregation mechanisms that allow to transform individual explanations into global, model-based insights such as feature importance.
Local Interpretable Model-agnostic Explanations is a surrogate explainer that fits a linear model to data (expressed in an interpretable representaion) sampled in the neighbourhood of an instance selected to be explained (Ribeiro, Singh, and Guestrin 2016). This local, inherently transparent model simplifies the black-box decision boundary in the selected sub-space, making it human-comprehensible. Given that these explanations are based on coefficients of the surrogate linear model, they can also be interpreted as (interpretable) feature importance.
Python | R |
---|---|
scikit-learn (>=0.24.0 ) |
iml |
alibi | vip |
Skater | DALEX |
rfpimp |