Floodgate: inference for model-free variable importance
Many modern applications seek to understand the relationship between an outcome variable $Y$ and a covariate $X$ in the presence of a (possibly high-dimensional) confounding variable $Z$. Although much attention has been paid to testing \emph{whether} $Y$ depends on $X$ given $Z$, in this paper we seek to go beyond testing by inferring the \emph{strength} of that dependence. We first define our estimand, the minimum mean squared error (mMSE) gap, which quantifies the conditional relationship between $Y$ and $X$ in a way that is deterministic, model-free, interpretable, and sensitive to nonlinearities and interactions. We then propose a new inferential approach called \emph{floodgate} that can leverage any working regression function chosen by the user (allowing, e.g., it to be fitted by a state-of-the-art machine learning algorithm or be derived from qualitative domain knowledge) to construct asymptotic confidence bounds, and we apply it to the mMSE gap. \acc{We additionally show that floodgate's accuracy (distance from confidence bound to estimand) is adaptive to the error of the working regression function.} We then show we can apply the same floodgate principle to a different measure of variable importance when $Y$ is binary. Finally, we demonstrate floodgate's performance in a series of simulations and apply it to data from the UK Biobank to infer the strengths of dependence of platelet count on various groups of genetic mutations.
PDF Abstract