Conditional Expectation. Best Linear Prediction

An important tool in the study of the relationships that exist between two jointly distributed random variables, and , is provided by the notion of conditional expectation. In section 3 of Chapter 7 the notion of the conditional distribution function of the random variable , given the random variable , is defined. We now define the conditional mean of , given , by

the last two equations hold, respectively, in the cases in which is continuous or discrete. From a knowledge of the conditional mean of , given , the value of the mean may be obtained:

Example 7A. Sampling from an urn of random composition. Let a random sample of size be drawn without replacement from an urn containing balls. Suppose that the number of white balls in the urn is a random variable. Let be the number of white balls contained in the sample. The conditional distribution of , given , is discrete, with probability mass function for and given by

since the conditional probability law of , given , is hypergeometric. The conditional mean of , given , can be readily obtained from a knowledge of the mean of a hypergeometric random variable;

The mean number of white balls in the sample drawn is then equal to

Now is the mean proportion of white balls in the urn. Consequently (7.5) is analogous to the formulas for the mean of a binomial or hypergeometric random variable. Note that the probability law of is hypergeometric if is hypergeometric and is binomial if is binomial. (See theoretical exercise 4.1 of Chapter 4.)

Example 7B. The conditional mean of jointly normal random variables. Two random variables, and , are jointly normally distributed if they possess a joint probability density function given by (2.18) . Then

Consequently, the conditional mean of , given , is given by

in which we define the constants and by

Similarly,

From (7.7) it is seen that the conditional mean of a random variable , given the value of a random variable with which is jointly normally distributed, is a linear function of . Except in the case in which the two random variables, and , are jointly normally distributed, it is generally to be expected that is a nonlinear function of .

The conditional mean of one random variable, given another random variable, represents one possible answer to the problem of prediction . Suppose that a prospective father of height wishes to predict the height of his unborn son. If the height of the son is regarded as a random variable and the height of the father is regarded as an observed value of a random variable , then as the prediction of the son’s height we take the conditional mean . The justification of this procedure is that the conditional mean may be shown to have the property that

for any function for which the last written integral exists. In words, (7.10) is interpreted to mean that if is to be predicted by a function of the random variable then the conditional mean has the smallest mean square error among all possible predictors .

From (7.7) it is seen that in the case in which the random variables are jointly normally distributed the problem of computing the conditional mean may be reduced to that of computing the constants and , for which one requires a knowledge only of the means, variances, and correlation coefficient of and . If these moments are not known, they must be estimated from observed data. The part of statistics concerned with the estimation of the parameters and is called regression analysis .

It may happen that the joint probability law of the random variables and is unknown or is known but is such that the calculation of the conditional mean is intractable. Suppose, however, that one knows the means, variances (assumed to be positive), and correlation coefficient of and . Then the prediction problem may be solved by forming the best linear predictor of , given , denoted by . The best linear predictor of , given , is defined as that linear function of the random variable , that minimizes the mean square error of prediction involved in the use of as a predictor of . Now

Solving for the values of and , denoted by and , at which these derivatives are equal to 0, one sees that and satisfy the equations

Therefore, , in which

Comparing (7.7) and (7.13), one sees that the best linear predictor coincides with the best predictor, or conditional mean, , in the case in which the random variables and are jointly normally distributed.

We can readily compute the mean square error of prediction achieved with the use of the best linear predictor. We have

From (7.14) one obtains the important conclusion that the closer the correlation between two random variables is to 1, the smaller the mean square error of prediction involved in predicting the value of one of the random variables from the value of the other. 

The Phenomenon of“Spurious” Correlation . Given three random variables , and , let and be defined by

(or in some similar way) as functions of , and . The reader should be careful not to infer the existence of a correlation between and from the existence of a correlation between and .

Example 7C. Do storks bring babies? Let be the number of women of child-bearing age in a certain geographical area, , the number of storks in the area, and , the number of babies born in the area during a specified period of time. The random variables and , defined by

then represent, respectively, the number of storks per woman and the number of babies born per woman in the area. If the correlation coefficient between and is close to 1, does that not prove that storks bring babies? Indeed, even if it is proved only that the correlation coefficient is positive, would that not prove that the presence of storks in an area has a beneficial influence on the birth rate there? The reader interested in a discussion of these delightful questions would be well advised to consult J. Neyman, Lectures and Conferences on Mathematical Statistics and Probability , Washington, D.C., 1952, pp. 143–154.

Theoretical Exercises

In the following exercises let , and be jointly distributed random variables whose first and second moments are assumed known and whose variances are positive.

7.1. The best linear predictor , denoted by , of , given and , is defined as the linear function , which minimizes . Show that

where

in which we define

7.2. The residual of with respect to and , denoted by , is defined by

Show that is uncorrelated with and . Consequently, conclude that the mean square error of prediction, called the residual variance of , given and , is given by

Next show that the variance of the predictor is given by

The positive quantity , defined by

is called the multiple correlation coefficient between and the random vector . To understand the meaning of the multiple correlation coefficient, express in terms of it the residual variance of , given and .

7.3. The partial correlation coefficient of and with respect to is defined by

in which for . Show that

7.4. (Continuation of example 7A). Show that

Exercises

7.1. Let be jointly distributed random variables with zero means, unit variances, and covariances , . Find (i) the best linear predictor of , given , (ii) the best linear predictor of , given , (iii) the partial correlation between and , given , (iv) the best linear predictor of , given and the residual variance of , given and , (vi) the residual variance of , given .

 

Answer

(i) ; (ii) ; (iii) ; (iv) ; (v) 0.35; (vi) 0.36.

 

7.2. Find the conditional mean of , given , if and are jointly continuous random variables with a joint probability density function vanishing except for , and in the case in which given by

7.3. Let , in which is uniformly distributed on 0 to 1. Show that for

Find the mean square error of prediction achieved by the use of (i) the best linear predictor, (ii) the best predictor.

7.4. Let , and be uncorrelated random variables with equal variances. Let . Show that

 

Answer

(i) ; (ii) 0.