Two variables are associated if they are not independent, i.e. if the value of one variable affects the value, or the distribution of the values, of the other. Thus, for a human population, height and weight are associated, and so are actual skin-colour and ethnicity. In the case of numerical variables an appropriate measure is the correlation coefficient. In the case of ordinal variables and categorical variables, an alternative measure of association is required.
Yule used the term ‘association’ in his 1900 paper that proposed a measure suitable for the case of two variables each having two categories.
For two categorical variables (A and B having, respectively, J and K categories) a measure with a probabilistic interpretation is Goodman and Kruskal's lambda (λ) suggested by Goodman and Kruskal in 1954. Suppose that we are asked to guess the category of B for the next observation. An intelligent guess would be the category that was the commonest so far (with marginal total equal to f0m , say). Judging by the past data, the probability that our guess will be correct will be P, given by P=f0m/f00, where f00=Σj Σk fjk, fj0=Σk fjk, f0k=Σj fjk and fjk is the frequency of the (j, k) category combination. Suppose now that the next observation belongs to category j of variable A. Our best guess for the category of B now corresponds to the maximum of fj1, fj2,…, fjK. The revised probability of our being correct will be estimated bySince the estimated probability of the next observation belonging to category j of A is estimated as fj0/f00, the estimated probability of being correct taking into account the category of A for the next observation is PA, given byThe statistic λ, given byis a measure of the proportional reduction in error resulting from knowledge about A.
With ordinal variables there are more appropriate measures that take account of the ordering. Let j and j′ be two categories of one ordinal variable, A, and let k and k′ be two categories of a second ordinal variable, B. The quantities S, D and TB are defined in terms of pairs of observations, one belonging to cell (j, k) and the other to cell (j′, k′):
S=the total number of pairs for which, when j>j′, k>k′;
D=the total number of pairs for which, when j>j′, k < k′;
TB=the total number of pairs for which, when j>j′, k=k′.
These quantities are calculated using every pair of observations. Two measures based on these statistics are
Goodman and Kruskal's gamma (
γ), proposed by Goodman and Kruskal in 1954, which is given by
and
Somers's dBA, proposed by the sociologist Robert H. Somers in 1962, which is the preferred measure when
B is believed to depend on
A. The formula is
See also two-by-two table.