Influence Functions

Recently I’ve been trying to wrap my head around the statistical machinery behind causal inference. Really I like Edward Kennedys distinction, where he says once we drawn our DAG and applied the backdoor criterion, the causal inference part is over, and the estimation part has begun. So really, i’ve been trying to wrap my mind around semi-parametric estimation theory.

There are many great mathematical tutorials, and while I am technically a math major, my days are spend coding in R and writing Bayesian models in numpyro so let’s just say I’m pretty rusty.

Before we dive in let’s establish some basic foundations.

Why?

Why do we care about influence functions? Well traditional statistical inference has been based on writing down a probability model for the data \(P_\theta\), indexed by some parameter (or vector of parameters) \(\theta\). For instance, suppose we wanted to estimate the mean of a sample \(X_1,X_2,...,X_n\) where \(X_i \sim P_\theta\).

Classical Approach

We say the \(P_\theta = N(\theta,1)\) for instance, making some parametric assumption. Then we can appeal to the law of expectation to compute

\[\mathbb{E}(\frac{1}{n} \sum_iX_i) = \theta\]

and i.i.d to compute variance,

\[\mathbb{V}(\frac{1}{n}\sum_iX_i) = \frac{1}{n}\]

Now we have an estimator for the mean that is unbiased and has variance that decreases at a rate of \(1/n\).

Non-parametric Approach

In order to compute the variance we had to appeal to the underlying distribution of the data, namely the normal distribution. If we had assumed instead say a Poisson distribution, the variance would have been different (although still scaled by n). Let’s say we want to assume nothing about the underlying distribution. We will simply consider the mean as a function of the underlying data given by

\[\psi(P) = \int_P yP(y)\]

Notice that here we introduced \(\psi\) to refer to an arbitrary functional of the data. Since random variables are functions themselves, functions of random variables are referred to as functionals. This may look a little strange at first if you are used to parametric theory (where measures don’t often come in) but think of this as a function that takes a distribution (not a random variable) and computes the expectation of that distribution by simply integrating over the support of that distribution (multiplied by the probability of the each element in the support).

The “truth” requires so reframing fram classical parametric statistics. Instead of having a true parameter, we now have a true data generating distribution. This is usually referred to as \(P_0\). Therefore our true functional is given by

\[\psi(P_0) = \int_P yP(y)\] to represent the functional of the true underlying data distribution. This may look a little weird at first if you are coming from parametric land. You might be looking for some \(\theta\) that is the true parametric value, but remember the truth is now a distribution not a parameter.

So what is the equivalent assumption from parametric land that our observed data follows some distribution? Well, it’s simply that \(P_0 = P_n\), where \(P_n\) is the empirical distribution of the data. If the truth is \(\psi(P_0)\) then our estimate of the truth is \(\psi(P_n)\). This is where people get the term plug-in estimator. We have plugged in the empirical distribution \(P_n\) hoping that it is close to the true distribution \(P_0\).

Assessing the quality of \(\psi(P_n)\)

Now that we have our estimate \(\psi(P_n)\) of the truth \(\psi(P_0)\) we can ask things like “how far away is the estimate from the truth”? Normally, as shown above, we could take the expected value of our estimate under some parametric distribution (like \(N(\theta,1)\)) and show that it is unbiased for the parameter of interest. However, since we have no parameters, we can’t follow this paradigm.

Instead we look at the following difference,

\[\psi(P_n) - \psi(P_0)\]

which we know from above is simply,

\[\frac{1}{n}\sum_i X_i - \int_P x P(x)\]

Because \(P_n\) is empirical based on a finite number of samples, we may never expect that \(P_n\) exactly equals \(P_0\). Instead of trying to compute the bias of some parameter of the distribution, we ask “as n grows, does \(P_n\) get closer to \(P_0\) at some rate”.

That is, we look at the distribution of

\[\lim_{n \rightarrow \infty} \sqrt{n}(\psi(P_n) - \psi(P_0)) \]

In the case of the sample mean we have,

\[\lim_{n \rightarrow \infty} \sqrt{n}(\frac{1}{n}\sum_i X_i - \psi(P_0)) \]

\[\lim_{n \rightarrow \infty} \sqrt{n}(\frac{1}{n}\sum_i [X_i - \psi(P_0)]) \] and since \(\mathbb{X_i} = \psi(P_0)\), we can appeal to the classical CLT to show that \[\sqrt{n}(\frac{1}{n}\sum_i [X_i - \psi(P_0)]) \rightarrow N(0,V(X_i))\]

Meaning that our empirical distribution based functional converges to the true functional and so is asymptotically unbiased with varaince given by \(V(X_i)\).

This may seem extremely circular. We started with some non-parametric assumptions that lead us right back to where any parametric person would start: the CLT. However, this is just an example of how to prove this for the mean using classical statistics. Let’s now show it using real semi-parametric machinery.

Influence Function

So far, we haven’t even mentioned influence functions. Everything has just been laying the groundwork for semiparametric estimation. Now we can delve into how we solve the distance between \(\psi(P_n)\) and \(\psi(P_0)\) using influence functions.

To start we do something a little weird. We imagine that we have “contaminated” our data distribution so that it contains some error \(\epsilon\). That is,

\[P_{\epsilon} = (1-\epsilon)P + \epsilon \tilde{P}\] where \(\epsilon \tilde{P}\) is the estimated distribution.

\[\psi(P_{\epsilon}) = \psi((1-\epsilon)P) + \psi(\epsilon \tilde{P})\] \[\psi(P_{\epsilon}) = \psi(P- \epsilon P) + \psi(\epsilon \tilde{P})\]

\[\psi(P_{\epsilon}) = \psi(P)+ \epsilon[\psi(\tilde{P})- \psi(P) ]\]

and therefore

\[\frac{d}{d\epsilon} \psi(P_{\epsilon}) = [\psi(\tilde{P})- \psi(P) ]\] captures the distance between \(\tilde{P}\) and \(P\).

In particular, if we let \(\tilde{P}\) be a point mass around a particular observation, \(\textbf{1}(y)\) we can recover the influence function of a single observation on the mean,

\[ y - \psi(P_0)\].