A procedure for choosing between competing models that is based on balancing model complexity against the quality of that model’s fit to the given data.
For a multiple regression model, one approach makes use of the Mallows Cp statistic, introduced by Mallows in 1964. With n observations and k explanatory variables (see regression), define s2 as the estimate of the experimental error variance. Then, for a model using just p of the k variables,where yj is an observation and is the corresponding fitted value. A model that fits well should have a Cp value close to p. An acceptable fit is provided by a model for which
where a=k−p+1, b=n−k−1, and Fa, b (α) is the value exceeded by chance on 100α% of occasions by a random variable having an F-distribution with a and b degrees of freedom. Typically, α=0.05 or 0.01.
A more generally applicable alternative is based on AIC (Akaike's information criterion) proposed by Akaike in 1969. For categorical data this amounts to choosing the model that minimizes G2−2ν, where G2 is the likelihood-ratio goodness-of-fit statistic and ν is the number of degrees of freedom associated with the model. If the Bayesian information criterion (BIC) (also called the Schwarz criterion) is used, then the quantity minimized is G2−ν ln n, where ln is the natural logarithm and n is the sample size. This usually results in the selection of a simpler model. A third alternative of this type is the Hannan-Quinn criterion, for which the quantity to be minimized is G2−2ν ln(ln n).
Whatever procedure is used for model selection, it is usually the case that the model fits less well (as measured by R2, the coefficient of determination, see ANOVA) when it is applied to new data. The reduction in fit is described as shrinkage. See also stepwise procedure.