{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. Expected minimum and maximum of i.i.d. uniform variables** \n",
"\n",
"Let $U_1, U_2, \\ldots, U_n$ be i.i.d. uniform on the interval $(0, 1)$, and let $L_n = \\max(U_1, U_2, \\ldots, U_n)$. That is, let $L_n$ be the largest of $U_1, U_2, \\ldots, U_n$. \n",
"\n",
"[It is a fact about independent continuous random variables that the chance that they are equal is $0$. So you don't have to worry about \"ties\". That is, you can assume that $U_1, U_2, \\ldots, U_n$ are $n$ distinct values.]\n",
"\n",
"**a)** For $0 < x < 1$, find $P(U_1 \\le x)$. Hence find $P(L_n \\le x)$.\n",
"\n",
"**b)** Find the density of $L_n$.\n",
"\n",
"**c)** Find $E(L_n)$. \n",
"\n",
"**d)** To interpret the answer to Part **c**, let $n=2$ for a start. Imagine marking the two values $U_1$ and $U_2$ on the unit interval. These two random values split the unit interval $(0, 1)$ into 3 pieces of random lengths. It is a fact (and makes intuitive sense) that the lengths of the 3 pieces are identically distributed. Use this to interpret your answer to $E(L_2)$, and then generalize the interpretation to $E(L_n)$.\n",
"\n",
"**e)** Now let $M_n = \\min(U_1, U_2, \\ldots, U_n)$ be the smallest of $U_1, U_2, \\ldots, U_n$. Use the idea in Part **d** to find $E(M_n)$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. Range of a uniform sample** \n",
"\n",
"[This problem will go faster if you have done the previous one.]\n",
"\n",
"Let $\\theta_1 < \\theta_2$ and suppose $X_1, X_2, \\ldots, X_n$ are i.i.d. uniform on the interval $(\\theta_1, \\theta_2)$. Let $\\theta = \\theta_2 - \\theta_1$ be the length of the interval.\n",
"\n",
"**a)** Let $M_n = \\min(X_1, X_2, \\ldots, X_n)$ be the sample minimum and $L_n = \\max(X_1, X_2, \\ldots, X_n)$ the sample maximum. The statistic $R_n = L_n - M_n$ is called the *range* of the sample and is a natural estimator of $\\theta$. Without calculation, explain why $R_n$ is biased, and say whether it underestimates or overestimates $\\theta$.\n",
"\n",
"**b)** Find the bias of $R_n$ and confirm that its sign is consistent with your answer to Part **a**. For large $n$, is the size of the bias large or small?\n",
"\n",
"**c)** Use $R_n$ to construct $T_n$, an unbiased estimator of $\\theta$. \n",
"\n",
"**d)** Compare $SD(R_n)$ and $SD(T_n)$. Which one is bigger? For large $n$, is it a lot bigger or just a bit bigger?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. Regression estimates**\n",
"\n",
"For a random mother-daughter pair, let $X$ be the height of the mother and $Y$ the height of the daughter. In the notation of [Section 11.3](http://stat88.org/textbook/content/Chapter_11/03_Least_Squares_Linear_Regression.html#) suppose $\\mu_X = 63.5$, $\\mu_Y = 63.7$, $\\sigma_X = \\sigma_Y = 2$, and $r(X, Y) = 0.6$.\n",
"\n",
"**a)** Find the equation of the regression line for estimating $Y$ based on $X$.\n",
"\n",
"**b)** Find the regression estimate of $Y$ given that $X = 62$ inches.\n",
"\n",
"**c)** Find the regression estimate of $Y$ given that $X$ is $2$ standard deviations above $\\mu_X$. You should be able to do this without finding the value of $X$ in inches."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. Estimating percentile ranks** \n",
"\n",
"It can be shown that for football shaped scatter plots it is OK to assume that each of the two variables is normally distributed. \n",
"\n",
"Suppose that a large number of students take two tests (like the Math and Verbal SAT), and suppose that the scatter plot of the two scores is football shaped with a correlation of 0.6. \n",
"\n",
"**a)** Let $(X, Y)$ be the scores of a randomly picked student, and suppose $X$ is on the the 90th percentile. Estimate the percentile rank of $Y$.\n",
"\n",
"**b)** Let $(X, Y)$ be the score of a randomly picked student, and suppose $Y$ is on the 78th percentile. Estimate the percentile rank of $X$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5. Least squares constant predictor**\n",
"\n",
"Let $X$ be a random variable with expectation $\\mu_X$ and SD $\\sigma_X$. Suppose you are going to use a constant $c$ as your predictor of $X$.\n",
"\n",
"**a)** Let $MSE(c)$ be the mean squared error of the predictor $c$. Write a formula for $MSE(c)$.\n",
"\n",
"**b)** Guess the value of $\\hat{c}$, the least squares constant predictor. Then prove that it is the least squares constant predictor.\n",
"\n",
"**c)** Find $MSE(\\hat{c})$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. No-intercept regression**\n",
"\n",
"Sometimes data scientists want to fit a linear model that has no intercept term. For example, this might be the case when the data are from a scientific experiement in which the attribute $X$ can have values near $0$ and there is a physical reason why the response $Y$ must be $0$ when $X=0$.\n",
"\n",
"So let $(X, Y)$ be a random point and suppose you want to predict $Y$ by an estimator of the form $aX$ for some $a$. Find the least squares predictor $\\hat{Y}$ among all predictors of this form."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7. Uncorrelated versus independent**\n",
"\n",
"Let $X$ have the uniform distribution on the three points $-1$, $0$, and $1$. Let $Y = X^2$.\n",
"\n",
"**a)** Show that $X$ and $Y$ are uncorrelated.\n",
"\n",
"**b)** Are $X$ and $Y$ independent?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**8. Regression equation**\n",
"\n",
"The regression equation can be written in multiple forms. For any particular purpose, one of the forms might be more convenient than the others. So it is a good idea to recognize them.\n",
"\n",
"For $a^* = r\\frac{\\sigma_Y}{\\sigma_X}$, which of the following is the equation of the regression line for estimating $Y$ based on $X$? More than one is correct.\n",
"\n",
"(i) $Y = a^*X + (\\mu_Y - a^*\\mu_X)$\n",
"\n",
"(ii) $\\hat{Y} = a^*X + (\\mu_Y - a^*\\mu_X)$\n",
"\n",
"(iii) $\\hat{Y} = a^*(X - \\mu_X) + \\mu_Y$\n",
"\n",
"(iv) $\\displaystyle{\\hat{Y} = r\\frac{X - \\mu_X}{\\sigma_X}}$\n",
"\n",
"(v) $\\displaystyle{\\frac{\\hat{Y} - \\mu_Y}{\\sigma_Y} = r\\frac{X - \\mu_X}{\\sigma_X}}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**9. Average of the residuals**\n",
"\n",
"**a)** In Data 8 we say that the regression line passes through the point of averages. Show this by setting $X = \\mu_X$ and finding the corresponding value of $\\hat{Y}$.\n",
"\n",
"**b)** Find $E(\\hat{Y})$. In Data 8 language, this is the average of the fitted values.\n",
"\n",
"**c)** Let $D = Y - \\hat{Y}$ be the residual as in [Section 11.5](http://stat88.org/textbook/content/Chapter_11/05_The_Error_in_Regression.html) Find the expectation of the residual and confirm that the answer justifies the following statement [from Data 8](https://inferentialthinking.com/chapters/15/6/Numerical_Diagnostics.html#average-of-residuals):\n",
"\n",
"\"No matter what the shape of the scatter diagram, the average of the residuals is 0.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**10. Variance decomposition**\n",
"\n",
"In this exercise you will find the relation between the variances of $Y$, its regression estimate $\\hat{Y}$, and the residual $D = Y - \\hat{Y}$.\n",
"\n",
"**a)** Find $Var(\\hat{Y})$.\n",
"\n",
"**b)** Show that the answer to Part **a** justifies the following statement [from Data 8](https://inferentialthinking.com/chapters/15/6/Numerical_Diagnostics.html#another-way-to-interpret-r):\n",
"\n",
"$$\n",
"\\frac{\\text{SD of fitted values}}{\\text{SD of } y} ~ = ~ \\vert r \\vert\n",
"$$\n",
"\n",
"Note: Usually, the result above is stated in terms of variances instead of SDs, and hence $r^2$ is sometimes called \"the proportion of variability explained by the linear model\".\n",
"\n",
"**c)** Justify the *decomposition of variance* formula $Var(Y) = Var(\\hat{Y}) + Var(D)$. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**11. Regression accuracy**\n",
"\n",
"For a random mother-daughter pair, let $X$ be the height of the mother and $Y$ the height of the daughter. Suppose the correlation is $r(X, Y) = 0.6$ and let $\\sigma_Y = 2$ inches.\n",
"\n",
"Let $\\hat{Y}$ be the regression estimate of the daughter's height $Y$ based on the mother's height $X$, and let $D = Y - \\hat{Y}$ be the residual or error in the regression estimate.\n",
"\n",
"**a)** Find $\\sigma_D$.\n",
"\n",
"**b)** Fill in the blank with a percentage: There is at least $\\underline{~~~~~~~~~~}$ chance that the estimate $\\hat{Y}$ is correct to within $3.2$ inches. \n",
"\n",
"Find the best bound you can, and justify your answer."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}