Stein's Paradox and Empirical Bayes
In mathematical statistics, Stein’s paradox is an important example that shows that an intuitive estimator which is optimal in many senses (maximum likelihood, uniform minimum-variance unbiasedness, best linear unbiasedness, etc.) is not optimal in the most formal, decision-theoretic sense.
This paradox is typically presented from the perspective of frequentist statistics, and this is the perspective from which we present our initial analysis. After the initial discussion, we also present an empirical Bayesian derivation of this estimator. This derivation larely explains the odd form of the estimator and justifies the phenomenon of shrinkage estimators, which, at least to me, have always seemed awkward to justify from the frequentist perpsective. I find the Bayesian perspective on this paradox quite compelling.
A Crash Course in Decision Theory
At its most basic level, statistical decision theory is concerned with quantifying and comparing the effectiveness of various estimators, hypothesis tests, etc. A central concept in this theory is that of a risk function, which is the expected value of the estimator’s error (the loss function). The problem of measuring error appropriately (that is, the choice of an appropriate loss function) is both subtle and deep. In this post, we will only consider the most popular choice, mean squared error,
MSE(θ,ˆθ)=Eθ‖θ−ˆθ‖2.
Here θ is the parameter we are estimating by ˆθ, and ‖⋅‖ is the Euclidean norm,
‖→x‖=√x21+⋯+x2n,
for →x=(x1,…xn). Mean squared error is the most widely used risk function because of its simple geometric interpretation and convenient algebraic properties.
While a choice of risk function quantifies the average error of a given estimator, the concept of admissibility provides one framework for comparing different estimators of the same quantity. If Θ is the parameter space, we say that the estimator ˆθ dominates the estimator ˆη if
MSE(θ,ˆθ)≤MSE(θ,ˆη)
for all θ∈Θ, and
MSE(θ0,ˆθ)<MSE(θ0,ˆη)
for some θ0∈Θ. An estimator is admissible if it is not dominated by any other estimator.
While this definition may feel a bit awkard at first, consider the following example. Suppose that there are only three estimators of θ, and their mean squared errors are plotted below.

In this diagram, the red estimator dominates both of the other estimators and is admissible.
The James-Stein Estimator
The James-Stein estimator seeks to estimate the mean, θ of a multivariate normal distribution, N(θ,σ2I). Here I is the d×d identity matrix, θ is an d-dimensional vector, and σ2 is the known common variance of each component.
If X1…Xn∼N(θ,σ2Id), the obvious estimator of θ is the sample mean, ˉX=1n∑ni=1Xi. This estimator has many nice properties: it is the maximum likelihood estimator of θ, it is the uniformly minimum-variance unbiased estimator of θ, it is the best linear unbiased estimator of θ, it is an efficient estimator of θ. The James-Stein estimator, however, will show that desipte all of these useful properties, when d≥3, the sample mean is an inadmissible estimator of θ.
The James-Stein estimator of θ for the same observations is defined as
ˆθJS=(1−(d−2)σ2n‖ˉX‖2)ˉX.
While the definition of this estimator appears quite strange, it essentially operates by shrinking the sample mean towards zero. The qualifier “essentially” is necessary here, because it is possible, when n‖ˉX‖2 is small relative to (d−2)σ2, that the coefficient on ˉX may be smaller than −1. At the end of our discussion, we will exploit this caveat to show that the James-Stein estimator itself is inadmissible.
We will now prove that the sample mean is inadmissible by calculating the mean squared error of each of these estimators. Using the bias-variance decomposition, we may write the mean squared error of an estimator as
MSE(θ,ˆθ)=‖Eθ(ˆθ)−θ‖2+tr(Var(ˆθ)).
We first work with the sample mean. Since this estimator is unbiased, the first term in the decomposition vanishes. It is well known that ˉX∼N(θ,σ2nI). Therefore, the mean squared error for the sample mean is
MSE(θ,ˉX)=dσ2n.
The mean squared error of the James-Stein estimator is given by
MSE(θ,ˆθJS)=dσ2n−(d−2)2σ4n2Eθ(1‖ˉX‖2).
Unfortunately, the derivation of this expression is too involved to reproduce here. For details of this derivation, consult Lehmann and Casella1.
We see immediately that the first term of this expression is the mean squared error of the sample mean. Therefore, as long as Eθ(‖ˉX‖−2) is finite, the James-Stein estimator will dominate the sample mean. Note that since θ=0 will lead to the smallest sample mean on average, Eθ(‖ˉX‖−2)≤E0(‖ˉX‖−2). When θ=0, ‖ˉX‖−2 has an inverse chi-squared distribution with d degrees of freedom. The mean of an inverse chi-squared random variable is finite if and only if there are at least three degrees of freedom, so we see that for d≥3,
MSE(θ,ˆθJS)≤dσ2n−(d−2)2σ4n2E0(1‖ˉX‖2)=dσ2n−(d−2)2σ4n2(1d−2)=dσ2n−(d−2)σ4n2,
so the James-Stein estimator dominates the sample mean, and the sample mean is therefore inadmissible.
The natural question now is whether or not the James-Stein estimator is admissible; it is not. As we previously observed, when ‖ˉX‖ is small enough, the coefficient in the James-Stein estimator may be smaller than −1, and, in this case, it is not shrinking ˉX towards zero. We may remedy this problem by defining a modified James-Stein estimator,
ˆθJS′=max{0,1−(d−2)σ2n‖ˉX‖2}⋅ˉX.
It can be shown that this estimator has smaller mean squared error than the James-Stein estimator. This modification amounts to estimating the mean as zero when ‖ˉX‖ is small enough to cause a negative coefficient, which is reminiscent of Hodge’s estimator. This modified James-Stein estimator is also not admissible, but we will not discuss why here.
Empirical Bayes and the James-Stein Estimator
The benefits of shrinkage are an interesting topic, but not immediately obvious. To me, the derivation of the James-Stein estimator using the empirical Bayes method illuminates this topic nicely and relates it to a fundamental tenet of Bayesian statistics.
As before, we are attempting to estimate the mean of the distribution N(θ,σ2I) with known variance σ2 from samples X1,…,Xn. To do so, we place a N(0,τ2I) prior distribution on θ. Combining these prior and sampling distributions gives the posterior distribution
θ|X1,…,Xn∼N(τ2σ2n+τ2⋅ˉX,(1τ2+nσ2)−1)
So the Bayes estimator of θ is
ˆθBayes=E(θ|X1,…,Xn)=τ2σ2n+τ2⋅ˉX.
The value of σ2 is known, but, in general, we do not know the value of τ2. We will now estimate τ2 from the data X1,…,Xn. This estimation of the hyperparameter τ2 from the data is what causes this approach to be empirical Bayesian, and not fully Bayesian. The difference between the fully Bayesian and empirical Bayesian approach is interesting both philosophically and decision-theoretically. Its practicality here is that it often allows us to more easily produce estimators that approximate fully Bayesian estimators with similar (though slightly worse) properties.
There are many ways to approach this problem within the empirical Bayes framework. The James-Stein estimator arises from this situation when we find an unbiased estimator of the coefficient in the definition of ˆθBayes. First, we note that the marginal distribution of ˉX is N(0,(σ2n+τ2)I). We can use this fact to show that
σ2n+τ2‖ˉX‖2∼Inv-χ2(d).
Since the mean of an inverse chi-squared distributed random variables with d≥3 degrees of freedom is 1d−2, we get that
E(1−(d−2)σ2n‖ˉX‖2)=τ2σ2n+τ2.
We therefore see that the empirical Bayes method, combined with unbiased estimation yields the James-Stein estimator.
To me, this derivation more clearly explains the phenomenon of shrinkage. Bayes estimators may often be seen as a weighted sum of the prior information, in this case, that the mean was likely to be close to zero, and the evidence, the observed values of X. In this context, it makes much more sense that an estimator which shrinks its estimate toward zero seem well-justified.
Lehmann, E. L.; Casella, G. (1998), Theory of Point Estimation (2nd ed.), Springer↩︎