In this post I will describe the setup of linear regression, and main results.
I will be looking at the fixed design setting, which is where we assume that the covariates $x_i^0$ are fixed, and then model the outcome variable $y_i^0$ as a linear function of the covariates plus some random error $\epsilon_i$.
Throughout I will be denoting the uncentered response variable vector by $Y^0$, and the uncentered design matrix by $\mathbf{X}^0$. The centred versions will be denoted by $Y$ and $\mathbf{X}$ respectively, where $\mathbf{X}$ has been centred column wise.
References:
- Lecture notes - Chapter 3
Model Setup
\[\begin{align*} y_i^0 = \alpha + \beta^T x_i^0 + \epsilon_i, \quad i = 1, \dots, n \end{align*}\]So far I will not make any assumptions about the distribution of the errors $\epsilon_i$. But will makes these clear when making statements about inference and estimation.
I emphasise that the error terms $\epsilon_i$ are the only random part on the RHS of the equation. The covariates $x_i^0$ are fixed, and the parameters $\alpha$ and $\beta$ are unknown.
The objective is to estimate the parameters $\alpha$ and $\beta$, and the standard approach is to use the least squares estimator:
\[\begin{align*} (\hat{\alpha}, \hat{\beta}) := \underset{\alpha, \beta}{\argmin}||Y^0 - (\alpha \mathbf{1} + \mathbf{X}^0 \beta)||_2^2 \end{align*}\]Here we are using the usual convention, that a real number added to a vector we add it to each element of the vector.
This least squares loss function can be motivated by assuming that the outcome variables $y_i^0$, conditional on the predictors $x_i^0$, are independent samples from the normal distribution $N(\alpha + \beta^T x_i^0, \sigma^2)$. This assumption can be equivalently made, by assuming the errors $\epsilon_i$ are i.i.d samples from the normal distribution $N(0, \sigma^2)$.
What I mean by motivated, is that the MLE of $\alpha$ and $\beta$, under this probabilistic assumption is equivalent to the least squares estimator.
Showing that Least Squares is Equivalent to MLE under Probabilistic Assumptions
Assume ${(\mathbf{x}_i, y_i) \mid i = 1, \dots n}$ are repeated independent samples from random variables $\mathbf{X}$ and $Y$ respectively. With $Y \mid \mathbf{X} \sim N(f(\mathbf{x};\mathbf{w}), \sigma^2)$.
We then can write that:
\[\begin{align} p(y_1, \dots y_n \mid \mathbf{x}_1, \dots \mathbf{x}_n; \bw, \sigma) &= \frac{\prod_{i=1}^{n}{p(\mathbf{x}_i,y_i;\bw, \sigma)}}{\prod_{i=1}^{n}{p(\mathbf{x}_i;\bw, \sigma)}} \\ &= \prod_{i=1}^{n}{p(y_i \mid \mathbf{x}_i;\bw, \sigma)} \end{align}\]And therfore under the distributional assumption the log-likelihood is given by:
\[\begin{align} \log(p(y_1, \dots y_n \mid \mathbf{x}_1, \dots \mathbf{x}_n; \bw, \sigma)) = C + \frac{\sum_{i=1}^{n}{(y_i - f(\mathbf{x}_i;\bw))^2}}{2\sigma^2} \end{align}\]And hence the MLE of $\bw$ is given by solving the least squares problem.
Alternatively, it can be sometimes be more easily thought of in the fixed design setting, where the covariates $x_i^0$ are fixed. And then we sample the $y_i$’s independently from the normal distribution $N(\alpha + \beta^T x_i^0, \sigma^2)$.
In that case we would just write the density as a function of the constants $\mathbf{x}_i$:
\begin{align}
Estimation
From the objective: \(\begin{align*} (\hat{\alpha}, \hat{\beta}) := \underset{\alpha, \beta}{\argmin}||Y^0 - (\alpha \mathbf{1} + \mathbf{X}^0 \beta)||_2^2 \end{align*}\)
We have the OLS solutions:
\[\begin{align*} \hat{\alpha} &= \bar{Y}^0 - \hat{\beta}^T \bar{X}^0 \\ \hat{\beta} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TY \end{align*}\]Can see that $\hat{\beta}$ is a linear function of the response variable $Y$. Also can see that if an intercept term is included in the model then the objective is equaivalent to solving:
\[\begin{align*} \hat{\beta} = \argmin_{\beta}||Y - \mathbf{X} \beta||_2^2 \end{align*}\]where $Y$ and $\mathbf{X}$ are the centered response variable and design matrix respectively.
Proof:
- Solution for intercept $\alpha$:
Minimising w.r.t to the constant $\alpha$, so just looking at terms that depend on $\alpha$:
\[\begin{align*} \frac{\partial}{\partial \alpha} - \alpha 2(Y^0)^T \mathbf{1} + \alpha (\mathbf{1}^T \mathbf{1}) \alpha + 2\beta^T \mathbf{X}^0 \alpha \mathbf{1} \\ = -2(Y^0)^T \mathbf{1} + 2\alpha n + 2\beta^T \mathbf{X}^0 \mathbf{1} \end{align*}\]Setting to zero gives:
\[\begin{align*} \hat{\alpha} &= \frac{1}{n}\left[\left(\sum_{i = 1}^{n}y_i^0 - \left(\sum_{i = 1}^{n}(x_i^0)^T \beta \right) \right)\right] \\ &= \bar{Y}^0 - \bar{X}^0 \beta \end{align*}\]Where $\bar{X}^0$ is the mean row vector of the design matrix.
Plugging $\hat{\alpha}$ back into the objective gives:
\[\begin{align*} Y^0 - (\hat{\alpha} \mathbf{1} + \mathbf{X}^0 \beta) = (Y^0 - (\bar{Y}^0 \mathbf{1} - \mathbf{1}(\bar{X}^0 \beta) + \mathbf{X}^0 \beta)) \\ = (Y^0 - \bar{Y}^0 \mathbf{1}) - (\mathbf{X}^0 - \mathbf{1} \bar{X}^0 )\beta \\ = (Y -\mathbf{X} \beta) \end{align*}\]Where $Y$ and $\mathbf{X}$ are the centered response variable and design matrix respectively.
- Solution for $\hat{\beta}$:
Geometric Interpretation
Least squares looks for $\beta$ that minimises the least squares distance, from $Y$ to a point in the column span of the design matrix $\mathbf{X}$. Therefore $\beta$ is chosen such that $Y$ is orthogonally projected on to the column space of $\mathbf{X}$. Recalling that the matrix that orthogonaly projects vectors onto the column space of a matrix $\mathbf{X}$ is the projection matrix:
\[\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]Unbiased
Under the model setup, and the additional assumption that $\mathbb{E}(\mathbf{\epsilon}) = 0$,
we have that:
Using this we can show that $\hat{\beta}$ is an unbiased estimator of $\beta$:
\[\begin{align*} \mathbb{E}(\hat{\beta}) &= \mathbb{E}((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TY) \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbb{E}(Y) \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\beta \\ &= \beta \end{align*}\]Gauss-Markov Theorem
We have shown that $\hat{\beta}$ is unbiased, under the assumption that $\mathbb{E}(\mathbf{\epsilon}) = 0$.
If we further assume that $Var(\mathbf{\epsilon}) = \sigma^2 I$, then the Gauss-Markov Theorem holds.
The Gauss-Markov Theorem
For all constant vectors $\mathbf{a} \in \mathbb{R}^p$, and linear unbiased estimators $\tilde{\beta}$ (linear in $Y$ i.e $t^TY$), then
In words, under the assumptions:
- $\mathbb{E}(\mathbf{\epsilon}) = 0$
- $Var(\mathbf{\epsilon}) = \sigma^2 I$, uncorrelated and homoscedastic errors.
The OLS estimator $\hat{\beta}$ is the best linear unbiased estimator (BLUE) of $\beta$. Where by “best” we mean it has the lowest variance.
It is important to note that the Gauss-Markov theorem does not require the errors to be normally distributed, and therefore do not need to assume that the errors are independent.
Scaling
A quick note on scaling in linear regression.
If you scale the $i\text{th}$ variable, i.e the $i\text{th}$ column of $\mathbf{X}$:
then, the reverse scaling is applied to the $i\text{th}$ coefficient $\beta_i$:
\[\begin{align*} \frac{\hat{\beta}_i}{a} \end{align*}\]If you scale the response variable $Y$ by a factor $a$, then the coefficients are scaled by the same factor.
\[\begin{align*} \hat{\beta}_i * a \end{align*}\]These can both be shown using the OLS solution $\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TY$.
Scaling predictor:
Denote the scaled predictor by \(X^*_{(i)} = a \cdot X_{(i)}\).
This can be written using a diagonal scale matrix (assume $i$ = 2):
\[\begin{align*} W = \begin{bmatrix} 1 & 0 & \dots & 0 \\ 0 & a & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1 \end{bmatrix} \end{align*}\]then we can write $X^* = XW$.
\[\begin{align*} \hat{\beta}^* &= (W^T\mathbf{X}^T\mathbf{X}W)^{-1}W^T\mathbf{X}^TY \\ &= W^{-1}\hat{\beta} \end{align*}\]therefore $\hat{\beta}^*_i = \hat{\beta}_i / a \, $.
Intervals and Testing
Model parameters
In order to carry out testing and construct confidence intervals, we additionally assume that:
\[\begin{align*} \epsilon \sim N(0, \sigma^2 I) \, . \end{align*}\]Under this assumption we then have that:
\[\begin{align*} \hat{\beta} \sim N(\beta, \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}) \end{align*}\]therefore in order to have a pivotal quantity we also require an estimate of the variance of the errors $\sigma^2$.
An unbiased estimator of $\sigma^2$:
\[\begin{align*} \hat{\sigma}^2 = \frac{\|Y - X\hat{\beta}\|^2}{n - p} \end{align*}\]and therefore an unbiased estimator of the variance of $\hat{\beta}$:
\[\begin{align*} \hat{\sigma}^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{align*}\]it can then be shown that:
\[\begin{align*} \frac{\hat{\beta_i} - \beta_i}{\hat{\sigma}_{\hat{\beta}_i}} \sim t_{n-p} \end{align*}\]- two tailed test
and then this can be used to construct confidence intervals and carry out hypothesis tests.
An important thing to note is that when carrying out linear regression in practice, in R for example, the estimators listed above are computed using the QR decomposition of the design matrix, which is more numerically stable. A description of the lower level functions used in R are described in this blog post. In addition the distribution of the estimators that are presented can be shown using the QR decomposition (as is done in the referenced lecture notes).
Prediction intervals
There are two types of prediction intervals that may want to be estimated. The prediction of a new observation given a feature vector $x$, and the prediction of the mean response given a feature vector $x$.
This stack exchange answer explains the difference.
predicting a new response:
When predicting a new response we need to take into account the variance associated with estimating $\beta$ and then the variance the error $\epsilon$ in our model. And we assume that the error is independent of $\hat{\beta}$, that is can think of getting our expected response:
\[\begin{align*} \hat{\beta}^T x \end{align*}\]and then the response that we do not observe will be:
\[\begin{align*} \tilde{y} = \hat{\beta}^T x + \epsilon \end{align*}\]for some additional independent error.
We therefore have that:
\[\begin{align*} \tilde{y} \sim N({\beta}^T x \, , \sigma^2(1 + x^T(\mathbf{X}^T\mathbf{X})^{-1}x) \, ) \end{align*}\]and similart the t-statistic for the coefficents, we have the following t-statistic for the prediction of a new response:
\[\begin{align*} \frac{\tilde{y} - \hat{\beta}^T x}{\hat{\sigma}\sqrt{1 + x^T(\mathbf{X}^T\mathbf{X})^{-1}x}} \sim t_{n-p} \end{align*}\]hence we can construct the following prediction interval:
\[\begin{align*} x^T\hat{\beta} \pm t_{n-p}(\alpha) \hat{\sigma}\sqrt{1 + x^T(\mathbf{X}^T\mathbf{X})^{-1}x} \end{align*}\]Then for the mean response confidence interval we just remove the plus 1 that appears in the variance.