12 Orthogonal vectors and vector spaces

12.1 Independence, norm and orthogonality

Perhaps you have already encountered the concept of independence or dependence intuitively when performing Gaussian elimination and concluding that some rows or columns in a matrix can be expressed as linear combinations of other rows or columns (dependence). Here we formally introduce the important concept of independence of a pair of vectors or of a vector and a set of vectors, and implicitly that of dependence as the lack of independence.

Definition 12.1 (Independence of vectors) Two non-zero vectors \(\vec{x}\) and \(\vec{y}\) are said to be independent if one can not be expressed as a scalar multiplication of the other, i.e. there exists no scalar \(a\) such that \(\vec{y} = a \vec{x}\).

A (non-zero) vector \(\vec{y}\) is independent of a set of vectors \(\left\{ \vec{x}_i | i=1 \ldots n \right\}\) if there exist no scalars \(a_i\) such that \(\vec{y}\) can be expressed as the linear combination \(\sum_{i=1}^n a_i \vec{x}_i\). In other words, \(\vec{y}\) is independent of the set \(\vec{x}_i\) if it is not a member of the vector space spanned by the set \(\left\{ \vec{x}_i | i=1 \ldots n \right\}\).

Being or not being a member of a vector space is an important property, for example for the solvability of the equation \(\mat{A} \vec{x} = \vec{b}\). If \(\vec{b}\) is not in the column space \(\csp{\mat{A}}\), i.e. if \(\vec{b}\) is independent of the vectors that are the columns of \(\mat{A}\), then the equation has no solution.

Independence of vectors is closely related, but not the same as orthogonality: a pair of orthogonal vectors is independent, but not every pair of independent vectors is orthogonal. Orthogonality and independence of a pair of vectors have a geometric pendant, namely a pair of vectors is orthogonal if they are oriented at a right angle (perpendicular) relative to each other, whereas two independent vectors in general can just be oriented at a any non-zero angle relative to each other.

To discuss the algebraic properties of orthogonality and independence we need to introduce the notion of the Euclidian norm, or simply norm of a vector. The norm of a vector is a positive number that we also call its size or its length.

Definition 12.2 (The norm of a vector) The Euclidian norm of a vector \(\vec{x}\) of is written as \(||\vec{x}||\), and equals

\[ ||\vec{x}|| = \sqrt{\sum_{i=1}^n x_i^2} \]

This can also be written in terms of a matrix multiplication as

\[ ||\vec{x}||^2 = \begin{bmatrix} x_1 & x_2 & \cdots & x_n\end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix} = \vec{x}^T \vec{x} \]

where \(\vec{x}^T\) is the transpose of the vector \(\vec{x}\).

We are now in the position to define orthogonality of two vectors \(\vec{x}\) and \(\vec{y}\) in terms of algebraic properties. We will look at their norms \(||\vec{x}||\) and \(||\vec{y}||\) and the norm of their difference \(||\vec{x} - \vec{y}||\). These vectors and their difference form a triangle. When \(\vec{x}\) and \(\vec{y}\) are perpendicular we say that they are orthogonal. Pythagoras’ theorem for the sides of a right triangle (Figure 12.1) then says that vectors and are orthogonal if

\[ ||\vec{x} - \vec{y}||^2 = ||\vec{x}||^2 + ||\vec{y}||^2 \tag{12.1}\]

Figure 12.1: When two vectors \(\vec{x}\) and \(\vec{y}\) are at a right angle then the theorem of Pythagoras applies to the relation between their lengths and that of their difference \(\vec{x} - \vec{y}\) (in red).

The consequence of Equation 12.1 can be worked out in terms of the components \(x_i\) and \(y_i\) of these vectors:

\[ \begin{align*} \sum_{i=1}^n {\left( x_i - y_i \right)}^2 & = \sum_{i=1}^n x_i^2 + \sum_{i=1}^n y_i^2 \yields \\ \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i y_i + \sum_{i=1}^n y_i^2 & = \sum_{i=1}^n x_i^2 + \sum_{i=1}^n y_i^2 \end{align*} \]

Hence, when the two vectors \(\vec{x}\) and \(\vec{y}\) are orthogonal it must be true that

\[ \sum_{i=1}^n x_i y_i = 0 \]

or, written as a matrix multiplication: when the two vectors \(\vec{x}\) and \(\vec{y}\) are orthogonal then

\[ \vec{x}^T \vec{y} = 0 \]

The expression \(\vec{x}^T \vec{y}\) is also known as the inner product or dot product of \(\vec{x}\) and \(\vec{y}\), also written as \(\vec{x} \cdot \vec{y}\). The fact that two vectors \(\vec{x}\) and \(\vec{y}\) are orthogonal is also often expressed as \(\vec{x} \perp \vec{y}\)

Is the result of the inner product a vector or a scalar?

Answer

A scalar, clearly.

It is easy to show that two orthogonal vectors are independent.

Proof. Suppose \(\vec{x} \perp \vec{y}\), or \(\vec{x}^T \vec{y} = 0\). Dependence of \(\vec{x}\) and \(\vec{y}\) would mean that there would exist a scalar \(a\) such that \(\vec{y} = a \vec{x}\). Left-multiplying both sides of the latter equation by \(\vec{x}^T\), this would mean

\[ \vec{x}^T\vec{y} = a \vec{x}^T\vec{x} \] or

\[ 0 = a \vec{x}^T\vec{x} = a ||\vec{x}|| \]

The only way this could be true is if \(a=0\). But that would mean that \(\vec{y} = \vec{0}\), which is not the case, since independence is not defined when \(\vec{x}\) or \(\vec{y}\) are the zero vector.

In general, if we have a pair of independent vectors \(\vec{x}\) and \(\vec{y}\) then, if \(\vec{y}\) is not orthogonal to \(\vec{x}\) it can at least be written as the sum of an orthogonal component \(\vec{y}_o\) and a parallel component \(\vec{y}_p\) (see Figure 12.2). The parallel component \(\vec{y}_p\) is constructed as the orthogonal projection of \(\vec{y}\) onto \(\vec{x}\). How do we calculate such a projection? We know that it is some factor \(a\) times \(\vec{x}\), or \(\vec{y}_p = a \vec{x}\). Also, the vector difference between \(\vec{y}\) and \(a \vec{x}\) is, by definition of the construction, orthogonal to \(\vec{x}\), or

\[ (\vec{y} - a \vec{x}) \perp \vec{x} \]

which implies that their inner product must equal zero

\[ \begin{align} \vec{x}^T (\vec{y} - a \vec{x}) & = 0 \\ a \vec{x}^T \vec{x} & = \vec{x}^T \vec{y} \\ a & = \frac{\vec{x}^T \vec{y}}{\vec{x}^T \vec{x}} = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}||^2} \end{align} \tag{12.2}\]

Therefore, the perpendicular projection of \(\vec{y}\) onto \(\vec{x}\) equals

\[ \vec{y}_p = a \vec{x} = \frac{\vec{x}\cdot\vec{y}}{||\vec{x}||^2} \vec{x} \]

It is now also easy to calculate the orthogonal component \(\vec{y}_o\) of \(\vec{y}\), because \(\vec{y} = \vec{y}_o + \vec{y}_p\), hence \(\vec{y}_o = \vec{y} - \vec{y}_p\).

From Figure 12.2 we can also see that there is a relation between the cosine of the angle \(\theta\) between \(\vec{y}\) and \(\vec{x}\) and the norm of the orthogonal projection, namely

\[ \cos{(\theta)} = \frac{||\vec{y}_p||}{||\vec{y}||} = \frac{a ||\vec{x}||}{||\vec{y}||} \tag{12.3}\]

In case \(\vec{y} \perp \vec{x}\) then \(\cos{(\theta)} = 0\) and the projection \(\vec{y}_p\) would disappear because \(||\vec{y}_p||=0\).

Figure 12.2: Two independent vectors, \(\vec{x}\) and \(\vec{y}\), where \(\vec{y}\) is decomposed as the sum \(\vec{y} = \vec{y}_o + \vec{y}_p\) of an orthogonal, independent component \(\vec{y}_o\) and a parallel, dependent \(\vec{y}_p\) component (both in red) of \(\vec{x}\). The parallel vector \(\vec{y}_p\) is constructed as the orthogonal projection of \(\vec{y}\) onto \(\vec{x}\).

The angle between two vectors

From Equation 12.3 and Equation 12.2 we can calculate the angle between two arbitrary vectors \(\vec{x}\) and \(\vec{y}\). In words: the cosine of the angle between \(\vec{x}\) and \(\vec{y}\) equals their inner product divided by the product of their lengths.

\[ \cos(\theta) = \frac{\vec{x} \cdot \vec{y} \, ||\vec{x}||}{||\vec{y}|| \, ||\vec{x}||^2} = \frac{\vec{x} \cdot \vec{y}}{||\vec{y}|| \, ||\vec{x}||} \]

Orthogonality of vector spaces, row space and left null space

There is a natural extension from the orthogonality of a pair of vectors to that of a vector to a vector space:

Definition 12.3 (Orthogonality of a vector and a vector space) A vector \(\vec{y}\) is said to be orthogonal to a vector space \(X\) if it is orthogonal to every vector \(\vec{x}\) in that vector space.

Since a vector space is defined as the span of a set of vectors \(\{\vec{x}_i|i=1\ldots n\}\), it is sufficient to require that \(\vec{y}\) is orthogonal to every vector in that set.

From the orthogonality of a vector to a vector space it is only one step to the concept of orthogonality of vector spaces themselves.

Definition 12.4 (Orthogonality of vector spaces) Two vector spaces \(X\) and \(Y\) are orthogonal if any pair of vectors \(\vec{x}\) from \(X\) and \(\vec{y}\) from \(Y\) are orthogonal.

Example 12.1 A plane in 3-dimensional space containing the origin and a line through the origin perpendicular to that plane are equivalent to orthogonal vector spaces.

The first case of orthogonal spaces that we treat is the vector space that is orthogonal to the null space \(\mathcal{N}(A)\) of a matrix \(\mat{A}\). The null space is defined as the space consisting of vector \(\vec{x}\) for which \(\mat{A}\vec{x} = \vec{0}\). Let’s have a closer look at what this means when \(\mat{A}\) is an \(m\) by \(n\) matrix:

\[ \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]

But this must mean that for every row vector \(\vec{a}_j\) (\(j = 1 \ldots m\)) in \(\mat{A}\) the inner product \(\vec{a}_j \cdot \vec{x} = 0\), i.e.

\[ \begin{align*} \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} & = 0 \\ \begin{bmatrix} a_{21} & a_{22} & \cdots & a_{2n} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} & = 0 \\ \vdots & \vdots \\ \begin{bmatrix} a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} & = 0 \\ \end{align*} \]

In other words, if we define all linear combinations of these row vectors as the row space (just like all linear combinations of the column vectors is the column space), then the null space of \(\mat{A}\) must be orthogonal to the row space of \(\mat{A}\). The row space of \(\mat{A}\) is actually the column space of its transpose \(\mat{A}^T\). We designate the row space of \(\mat{A}\) as \(\rsp{\mat{A}}\). Hence, the statements that we just made can be written symbolically as \(\mathcal{N}(\mat{A}) \perp \rsp{\mat{A}}\), and \(\rsp{\mat{A}} = \csp{\mat{A}^T}\).

We have, until now, defined three vector spaces related to a matrix \(\mat{A}\).

The null space \(\mathcal{N}(\mat{A})\) is the set of all solutions \(\vec{x}\) to \(\mat{A}\vec{x}=\vec{0}\)
The column space \(\csp{\mat{A}}\) is the space spanned by the column vectors of \(\mat{A}\)
The row space \(\rsp{\mat{A}}\) is the space spanned by the column vectors of \(\mat{A}^T\)

The last of the so-called fundamental vector spaces is the left null space of \(\mat{A}\) which (you maybe guessed it) is defined as

The left null space \(\lsp{\mat{A}}\) is the set of all solutions \(\vec{y}\) to \(\mat{A}^T\vec{y}=\vec{0}\)

Vectors from the null space have \(n\) entries, whereas vectors \(\vec{y}\) from the left null space have \(m\) entries, the number of rows in \(\mat{A}\). In addition, and analogous to the orthogonality of the null space and the row space \(\nul{\mat{A}} \perp \rsp{\mat{A}}\), we can also prove orthogonality of the left null space and the column space: \(\lsp{\mat{A}} \perp \csp{\mat{A}}\).

The two pairs of orthogonal spaces \(\mathcal{N}(\mat{A})\) and \(\rsp{\mat{A}}\) on the one hand and \(\lsp{\mat{A}}\) and \(\csp{\mat{A}}\) on the other are also orthogonal complements of the spaces \(\mathbb{R}^n\) and \(\mathbb{R}^m\), respectively. This means that these sets of vectors together describe all possible vectors in \(\mathbb{R}^n\) and \(\mathbb{R}^m\). In other words any vector in \(\mathbb{R}^n\) can be described as a linear combination of vectors from \(\mathcal{N}(\mat{A})\) and \(\rsp{\mat{A}}\), and any vector in \(\mathbb{R}^m\) can be described as a linear combination of vectors from \(\lsp{\mat{A}}\) and \(\csp{\mat{A}}\).

12.2 Data vectors, data matrix

As scientists we may wonder whether there exist relations between measurements made on objects or people or samples. For example, when measuring altitude and average temperature of towns we might expect a relation: towns located at higher altitudes probably have lower average temperature, in general. The relation will not be perfect, some towns will have geographical conditions that cause them to deviate from the general trend. Yet, how can we numerically characterize the global trend?

Suppose we have \(n\) towns and call the altitude of a town with index number \(i\) \(x_i\) and its average temperature \(y_i\) (see Figure 12.3). We can then have two views of these measurements.

Figure 12.3: Example of a data set with two variables and 349 samples. The rows can be regarded as 349 vectors in 2-dimensional space, or the columns can be regarded as two vectors in 349-dimensional space.

We can regard the tuples \((x_i, y_i)^T\) as (column) vectors \(\vec{v}_i\) and display them in a 2-dimensional scatterplot with an \(x\) and \(y\)-axis as in Figure 12.4. This is the usual way in which we display such pairs of measurements. In the figure we see that there is a negative correlation between the variables \(x\) and \(y\), because the points are scattered around a line with a negative slope.

Figure 12.4: Variables altitude (\(x\), in m) and temperature (\(y\), degrees C). The dots in this grap show the endpoints of the vectors \(\mathbf{v}_i = (x_i, y_i)^T\) as demonstrated for two data pairs at the red dots. Data from Mooij et al. (2016).

An alternative way is to regard the column vectors \(\vec{x} = (x_1, x_2, \ldots, x_n)^T\) and \(\vec{y} = (y_1, y_2, \ldots, y_n)^T\) as two vectors in \(\mathbb{R}^n\). These are called the data vectors.

Clearly, since the space in which data vectors live has the same number of dimensions \(n\) as the number of objects that were measured we can not usually display them unless \(n=2\) or \(3\). Therefore, in most cases these data vectors require a bit more imagination, are more abstract, than the set of vectors \((x_i,y_i)^T\). Yet, data vectors are a useful concept as we will see.

You could ask how a positive (or negative) correlation as displayed in Figure 12.4 is reflected in the relative positions of the data vectors \(\vec{x}\) and \(\vec{y}\) in that \(n\)-dimensional space. Let’s take an extreme case: suppose we have a perfect proportional dependency between the values \(x_i\) and \(y_i\), for example \(y_i = \beta x_i\) for all \(i= 1 \ldots n\), where \(\beta\) is a scalar. Then \(\vec{x}\) and \(\vec{y} = \beta \vec{x}\) are two perfectly aligned vectors. The angle \(\theta\) between these vectors equals 0 or 180 degrees, depending on the sign of \(\beta\) (\(\text{sgn}(\beta)\)) and the cosine of this angle equals \(\pm1\) since

\[ cos(\theta) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \, ||\vec{y}||} = \frac{\vec{x} \cdot \beta \, \vec{x}}{||\vec{x}||\, ||\beta \,\vec{x}||} = \frac{\beta \, \vec{x}\cdot\vec{x}}{||\vec{x}||\, |\beta| \,||\vec{x}||} = \frac{\beta}{|\beta|} \frac{||\vec{x}||^2}{||\vec{x}||^2} = \text{sgn}({\beta})\cdot 1 \]

However, for a general linear relation like \(\vec{y} = \alpha + \beta \vec{x}\), similar to the one that we see in Figure 12.4, this is not the case. In that case we need to do something with the data first, i.e. we first need to transform \(\vec{y}\) and \(\vec{x}\) to find a similar relation for the angle between them.

Centering and normalizing

The transformation that we need to perform is centering of the vectors \(\vec{x}\) and \(\vec{y}\) around their means. This means that we have to subtract the averages of the \(x_i\) or \(y_i\) from every \(x_i\) and \(y_i\), respectively. Hence, we calculate

\[ \begin{align*} x_i' &= x_i - \overline{x} \\ y_i' &= y_i - \overline{y} \end{align*} \]

In terms of the data vectors this boils down to

\[ \begin{align*} \vec{x}' &= \vec{x} - \overline{x} \vec{1} \\ \vec{y}' &= \vec{y} - \overline{y} \vec{1} \end{align*} \]

where \(\vec{1}\) is the unit vector with the same dimension as the data vectors (note that the averages \(\overline{x}\) and \(\overline{y}\) are scalars). We call \(\vec{x}'\) and \(\vec{y}'\) the centered transformations of \(\vec{x}\) and \(\vec{y}\). If there is a perfect linear relation between \(x_i\) and \(y_i\), for example \(y_i = \alpha + \beta x_i\) or \(\vec{y} = \alpha \vec{1} + \beta \vec{x}\) then

\[ \overline{y} = \frac{1}{n} \sum y_i = \frac{1}{n} \sum \left( \alpha + \beta x_i \right) = \alpha + \beta \overline{x} \] and

\[ \begin{align*} \vec{x}' &= \vec{x} - \overline{x} \vec{1} \\ \vec{y}' &= \alpha \vec{1} + \beta \vec{x} - \alpha \vec{1} - \beta \overline{x} \vec{1} \\ &= \beta \left( \vec{x} - \overline{x} \vec{1}\right) \\ &= \beta \vec{x}' \end{align*} \] which means that, even though the original vectors are not proportional (but are linearly related!), the centered vectors \(\vec{x}'\) and \(\vec{y}'\) would be proportional. Hence, the angle between these vectors would be 0 and the cosine of the angle would be equal to \(\pm 1\).

Conversely, if \(\vec{x}\) and \(\vec{y}\) would be vectors with random, statistically independent entries \(x_i\) and \(y_i\) then the cosine of the angle will be on average close to 0 their and angle will be close to 90 degrees. To see this, let’s express the angles between the centered vectors in terms of the original vectors:

\[ \begin{align*} \cos{(\theta)} &= \frac{\vec{x}' \cdot \vec{y}'}{||\vec{x}'||\,||\vec{y}'||} = \frac{(\vec{x} - \overline{x}\vec{1}) \cdot (\vec{y} - \overline{y} \vec{1})}{||\vec{x} - \overline{x}\vec{1}||\,||\vec{y} - \overline{y}\vec{1}||} \\ &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum(x_i - \overline{x})^2} \sqrt{\sum(y_i - \overline{y})^2}} \end{align*} \tag{12.4}\]

If \(x_i\) and \(y_i\) are independent random numbers, possibly with different means and standard deviations, then the products \((x_i - \overline{x})(y_i - \overline{y})\) will on average be equally often negative as positive, and their sum will on average cancel out. So, on average the cosine of the angle between the vectors \(\vec{x}'\) and \(\vec{y}'\) will be equal to \(0\).

These formulas also suggest an extension of the centering transformation, namely centering and normalization

\[ z(\vec{x}) \equiv \frac{\vec{x}'}{||\vec{x}'||} = \frac{\vec{x} - \overline{x} \, \vec{1}}{||\vec{x} - \overline{x} \,\vec{1}||} \]

that is, we scale the centered data vector by its length. This is also called the \(z\)-transform. You may recognize the last expression in Equation 12.4 from statistics. It is the formula for Pearson’s correlation coefficient of data sets \(x_i\) and \(y_i\). In other words, Pearson’s correlation coefficient equals the cosine of the angle between the centered data vectors. You may also recognize the expressions \(\sqrt{\sum(x_i - \overline{x})^2}\) and \(\sqrt{\sum(y_i - \overline{y})^2}\) in the denominator: they are the standard deviations of the original data sets \(x_i\) and \(y_i\). Finally, the term \(\sum (x_i - \overline{x})(y_i - \overline{y})\) (or \(\vec{x}' \cdot \vec{y}'\)) is called the covariance of data vectors \(\vec{x}\) and \(\vec{y}\).

Figure 12.5: Centered and normalized transformation of the data in Figure 12.4. The corresponding vectors are also displayed.

By the way, the cosine of the angle between the centered data vectors \(\vec{x}'\) and \(\vec{y}'\), or the correlation coefficient of \(\vec{x}\) and \(\vec{y}\) of the altitude data set equals -0.866 or \(\theta \approx 150\) degrees: the centered data vectors \(\vec{x}'\) and \(\vec{y}'\) point in almost opposite directions.

From a mathematical point of view there is no fundamental difference between the two views of the data in Figure 12.3, or a similar data set with many more columns (variables). Statistically, however, there is such a fundamental difference, for example when considering centering the data. What is that fundamental difference?.

Answer

Taking the average of several measurements of temperature, for example, makes sense, however, it does not make sense to take the average of two numbers with different physical dimensions like altitude and temperature. So, it makes sense to center the columns, but not the rows. A statistician would say that altitude and temperature are not samples from the same distribution, and so averaging does not make sense.

The data matrix

Suppose that we have measured \(m\) variables in \(n\) samples. For example, we have measured the number of messenger RNA’s of every gene in every sample. Each gene then has a corresponding data vector, a column vector of length \(n\). We could put the data column vectors for every gene in a matrix, in which every column corresponds to a data vector:

\[ \mat{X} = \begin{bmatrix} \vec{x}_1 & \vec{x}_2 & \ldots & \vec{x}_m\end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1m} \\ x_{21} & x_{22} & \ldots & x_{2m}\\ \vdots & & \ldots & \vdots \\ x_{n1} & x_{n2} & \ldots & x_{nm} \end{bmatrix} \]

We call this matrix a the data matrix of this data set. The data matrix is an important concept in multivariate data analysis. We can make the following observations about this matrix:

Every column \(x_{\bullet j}\) or \(\vec{x}_j\) in a data matrix represents values of a single variable measured in all samples
Every row \(x_{i \bullet}\) represents values of all variables measured in a single sample
Every entry \(x_{ij}\) in the data matrix is the value of a single measurement

The variance-covariance matrix

We could calculate the mean-centered (but not normalized) version of every column data vector, \(\vec{x}_1'\), \(\vec{x}_2'\), etc. We would then obtain the matrix

\[ \mat{X}' = \begin{bmatrix} \vec{x}_1' & \vec{x}_2' & \ldots & \vec{x}_m'\end{bmatrix} \] Multiplying this matrix by its transpose \(\mat{X}'^T\) we obtain the \(m \times m\) matrix

\[ \mat{X}'^T \mat{X}' = \begin{bmatrix} \vec{x}_1' \cdot \vec{x}_1 & \vec{x}_1' \cdot \vec{x}_2' & \ldots & \vec{x}_1' \cdot \vec{x}_m' \\ \vec{x}_2' \cdot \vec{x}_1 & \vec{x}_2' \cdot \vec{x}_2' & \ldots & \vec{x}_2' \cdot \vec{x}_m'\\ \vdots & & \ldots & \vdots \\ \vec{x}_m' \cdot \vec{x}_1 & \vec{x}_m' \cdot \vec{x}_2' & \ldots & \vec{x}_m' \cdot \vec{x}_m' \end{bmatrix} = \begin{bmatrix} ||\vec{x}_1'||^2 & \vec{x}_1' \cdot \vec{x}_2' & \ldots & \vec{x}_1' \cdot \vec{x}_m' \\ \vec{x}_2' \cdot \vec{x}_1 & ||\vec{x}_2'||^2 & \ldots & \vec{x}_2' \cdot \vec{x}_m'\\ \vdots & & \ldots & \vdots \\ \vec{x}_m' \cdot \vec{x}_1 & \vec{x}_m' \cdot \vec{x}_2' & \ldots & ||\vec{x}_m'||^2 \end{bmatrix} \]

Obviously, this is a symmetric matrix since the inner product is a symmetric operator. This matrix is called the variance-covariance matrix. On the diagonals we find the variances of the vectors \(\vec{x}_i\) (or \(\vec{x}_i'\)). The off-diagonal elements constitute the inner products of pairs of non-identical vectors and these are the covariances of the original data vectors.

12.3 Regression

Often, we want to predict the value of a numerical variable \(y\) based on values of a set of other variables \(x_j\) (\(j = 1 \ldots m\)), i.e. we want to express \(y\) as a function of these predictor variables, \(y \approx f(x_1, \ldots , x_m)\). In statistics this is called a regression. Linear regressions are the simplest form of regression. In this case the function \(f\) has the form:

\[ f(x_1, \ldots , x_m) = b_0 + b_1 x_1 + \ldots + b_m x_m \]

where \(b_0\), \(b_1\) etc. are parameters whose values have to be suitably chosen (fitted) so that the predicted value of \(y\) is as close as possible to the measured value. The phrase “as close as possible” will be specified below. The way we fit these parameters is by measuring several values of \(y\) and several values of the predictor variables. In general, if we have measured the variable \(y\) as well as the \(m\) variables \(x_ij\) in \(n\) samples then we fit this whole data set to the same function such that:

\[ \begin{align} y_1 &\approx b_0 + b_1 x_{11} + \ldots + b_m x_{1m} \\ y_2 &\approx b_0 + b_1 x_{21} + \ldots + b_m x_{2m} \\ \vdots & \qquad \qquad \qquad \qquad \qquad \vdots \\ y_n &\approx b_0+ b_1 x_{n1} + \ldots + b_m x_{nm} \end{align} \]

or in matrix notation

\[ \vec{y} \approx \left[ \vec{1} | \mat{X} \right] \, \vec{b} \]

where \(\left[ \vec{1} | \mat{X} \right]\) is the data matrix with a unit vector augmented as its first column. Instead of writing \(\left[ \vec{1} | \mat{X} \right]\) we will from now on simply use \(\mat{X}\) and understand that this is the data matrix with the unit vector augmented. If we represent the differences between the measured and predicted values by a vector \(\vec{e}\) of “error” terms then we can state the problem as

\[ \vec{y} = \mat{X} \vec{b} + \vec{e} \] We want the values in the error vector \(\vec{e}\) to be as small as possible. The criterion ususally chosen is that the sum of squares of these differences between predictions and measurements should be minimized by the choice of parameters, i.e.

\[ \text{choose }\vec{b}\text{ such that }\sum e_i^2 = \vec{e}^T \vec{e} = ||\vec{e}||^2\text{ is minimized} \]

In other words, the the squared norm of \(\vec{e}\), or equivalently the norm \(||\vec{e}||\) should be minimized. The question is, how do we determine good values for the elements \(b_i\), or, how do we determine the \(\vec{b}\) that minimizes \(||\vec{e}||\)? To answer this question we get back to the fundamental vector spaces defined above.

The least squares solution to \(\vec{y} = \mat{X}\vec{b}\)

We are first going to look at this equation from the perspective of what we have learned before. In the equation \(\vec{y} = \mat{X}\vec{b}\) the vector \(\vec{y}\) and matrix \(\mat{X}\) are known and the matrix of parameters \(\vec{b}\) is the unknown vector that has to be solved.

When having many more rows than columns in \(\mat{X}\), which is the case in regression problems, there is in general no solution to this equation, because it is very unlikely that \(\vec{y}\) is a member of the column space \(\csp{\mat{X}}\) of \(\mat{X}\). A non-admissible vector \(\vec{y}\), i.e. one for which there is no solution must be at least partly in the left null space. A more detailed way of saying this is the following. If we would write such a vector as the sum \(\vec{y} = \vec{y}_c + \vec{y}_l\) of vectors \(\vec{y}_c\) and \(\vec{y}_l\) that are members of the column space and the left null space, respectively, then for non-admissible vectors the part \(\vec{y}_l \neq \vec{0}\). Vectors \(\vec{y}_c\) and \(\vec{y}_l\) form orthogonal components of \(\vec{y}\), because they are members of orthogonal spaces.

To illustrate this geometrically, suppose \(\vec{y}\) is defined in \(\mathbb{R}^3\), and the column space of matrix \(\mat{X}\) is a 2-dimensional subspace of \(\mathbb{R}^3\), i.e. the equivalent of a plane in three dimensional space. The left null space \(\lsp{\mat{X}}\) must then be equivalent to the line that runs through the origin perpendicularly to the plane (Figure 12.6). The equation \(\mat{X}\vec{b} = \vec{y}\) is not solvable for \(\vec{b}\) because \(\vec{y}\) is not a member of the column space of \(\mat{X}\), or \(\vec{y}\) is not in \(\csp{\mat{X}}\). However, we could solve the equation \(\mat{X}\vec{b} = \vec{y}_c\), using the orthogonal projection of \(\vec{y}\) in \(\csp{\mat{X}}\), giving the shortest distance between \(\vec{y}\) and \(\csp{\mat{X}}\). Does that bring us any further? Yes, because this is exactly what we would otherwise call a least squares solution. To support this argument, we illustrate it by a concrete example.

Figure 12.6: Two orthogonal vector spaces, the plane (the column space \(\csp{\mat{X}}\) of a matrix \(\mat{X}\)) and the line (the left null space \(\lsp{\mat{X}}\) of that same matrix). These are complementary subspaces of \(\mathbb{R}^3\). The orthogonal projection of a vector \(\vec{y}\) in \(\mathbb{R}^3\) onto these spaces yields orthogonal vectors \(\vec{y}_c\) and \(\vec{y}_l\), such that \(\vec{y} = \vec{y}_c + \vec{y}_l\).

Example 12.2 (Least squares fitting) Suppose we have measurements of a response variable, weight (\(y\)) and a predictor variable tail length (\(x\)), of four mice. We observe that there is a relation between tail length and weight, and that we can probably predict the weight using the tail length. This is a significant finding which we intend to publish in the journal Nature. We want to know whether the relation between tail length and weight can be described by a linear equation:

\[ y = b_0 + b_1 x \]

We have measured the pairs of values \((y, x)\) in our four mice: \((3.0, 2.5)\), \((5.2, 3.2)\), \((4.3, 2.9)\) and \((4.5, 3.0)\). If the equation would describe the relation exactly then we could write for each of these pairs:

\[ \begin{align*} b_0 + b_1 2.5 & = 3.0 \\ b_0 + b_1 3.2 & = 5.2 \\ b_0 + b_1 2.9 & = 4.3 \\ b_0 + b_1 3.0 & = 4.5 \end{align*} \]

or in matrix notation:

\[ \begin{bmatrix} 1 & 2.5 \\ 1 & 3.2 \\ 1 & 2.9 \\ 1 & 3.0 \end{bmatrix} \begin{bmatrix} b_0 \\ b_1 \end{bmatrix} = \begin{bmatrix} 3.0 \\ 5.2 \\ 4.3 \\ 4.5 \end{bmatrix} \]

or in shorthand, \(\mat{X}\vec{b} = \vec{y}\). Here, the unknown vector to be solved is \(\vec{b}\), the vector of coefficients of the fitting equation. The matrix \(\mat{X}\) is the data matrix, here with only two columns. In general, \(\vec{y}\) will not be in the column space \(\csp{\mat{X}}\), and the equation is not solvable. However, it could be made solvable by substituting the orthogonal projection \(\vec{y}_c\) of \(\vec{y}\) onto \(\csp{\mat{X}}\). Does that help us any further? Yes, because the equation \(\mat{X}\vec{b} = \vec{y}_c\) solves the least squares problem. The sum of squared differences between the predicted value using the parameter values that solve \(\mat{X}\vec{b} = \vec{y}_c\) and the measured values of \(\vec{y}\) is the squared norm of the error \(\vec{e}\) (see Figure 12.6): it equals \(||\mat{X}\vec{b} - \vec{y}||^2 = ||\vec{e}||^2\). We get the smallest error \(\vec{e}\) when using an orthogonal projection of \(\vec{y}\) onto \(\csp{\mat{X}}\) to obtain a solution for \(\vec{b}\).

To calculate this solution, we use the fact that every column vector in \(\mat{X}\) must be orthogonal to \(\vec{e}\). Hence, every column vector in \(\mat{X}\) must be orthogonal to \(\mat{X} \vec{b} - \vec{y}\). This is the case if

\[ \begin{align*} \mat{X}^T (\mat{X} \vec{b} - \vec{y}) & = \vec{0} \\ \mat{X}^T \mat{X} \vec{b} - \mat{X}^T \vec{y} & = \vec{0} \\ \mat{X}^T \mat{X} \vec{b} & = \mat{X}^T \vec{y} \end{align*} \tag{12.5}\]

The last equation can be solved by Gaussian elimination. Notice the similarity to Equation 12.2, where the orthogonal projection onto a single vector was calculated. Solving \(\mat{X}^T \mat{X} \vec{b} = \mat{X}^T \vec{y}\) is equivalent to solving the equation \(\mat{X} \vec{b} = \vec{y}_c\). It is guaranteed to have a solution because \(\vec{y}_c\) is in the column space of \(\mat{X}\).

Example 12.3 (Continuation from Example 12.2) Similarly, we could also fit a parabola \(y = b_0 + b_1 x + b_2 x^2\). We would then have to find the least-squares solution of the matrix equation \[ \begin{bmatrix} 1 & 2.5 & 2.5^2 \\ 1 & 3.2 & 3.2^2 \\ 1 & 2.9 & 2.9^2 \\ 1 & 3.0 & 3.0^2 \end{bmatrix} \begin{bmatrix} b_0 \\ b_1 \\ b_2 \end{bmatrix} = \begin{bmatrix} 3.0 \\ 5.2 \\ 4.3 \\ 4.5 \end{bmatrix} \] In fact, using this approach we could solve the least squares problem for any function which is linear in the coefficients \(b_0\), \(b_1\), etc..

Summary

Data sets consisting of measurements of \(m\) different variables on \(n\) objects can be regarded as vectors \((x_{i1}, x_{i2}, \ldots,x_{im})\) per object \(i\) or as data vectors \(\vec{x}_j\) per variable. The matrix \(\mat{X} = (x_{ij})\) whose columns are the data vectors is called a data matrix.
The centered transform \(\vec{x}_p'\) of a data vector \(\vec{x}_p\) equals \(\vec{x}_p - \overline{x_p}\vec{1}\) where \(\overline{x_p}\) is the average of the elements of \(\vec{x}_p\).
The centered and normalized transform of a data vector \(\vec{x}_p\) equals \(z(\vec{x}_p) = \frac{\vec{x}_p'}{||\vec{x}_p'||}\), or the centered transform of \(\vec{x}_p\) divided by its standard deviation.
The cosine of the angle between two centered data vectors \(\vec{x}_p'\) and \(\vec{x}_q'\) equals the Pearson correlation coefficient of \(\vec{x}_p\) and \(\vec{x}_q\). It also equals the inner product of the \(z\)-transforms of these vectors.
The square of the length of a centered transform \(||\vec{x}_p'||^2\) of \(\vec{x}_p'\) equals the variance of \(\vec{x}_p\). The length \(||\vec{x}_p'||\) equals the standard deviation of \(\vec{x}_p\).

12.4 Exercises

Inner product

Exercise 54.

Prove that the inner product is commutative and distributive, i.e. \(\vec{x} \cdot \vec{y} = \vec{y} \cdot \vec{x}\) and \(\vec{x} \cdot (\vec{a} + \vec{b}) = \vec{x} \cdot \vec{a} + \vec{x} \cdot \vec{b}\).

Norms

Exercise 55.

Prove that \(||a\vec{x}|| = |a| ||\vec{x}||\), where \(a\) is a (positive or negative) scalar.

Exercise 56.

What is the norm \(||z(\vec{x})||\) of a vector

\[ z(\vec{x}) = \frac{\vec{x}'}{||\vec{x}'||} \]

Exercise 57.

Give an expression purely in terms of vector arithmetic of \(\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\). Then use this expression to rewrite the \(z\)-transform.

Least squares

Exercise 58.

Discuss why the last row \(\mat{X}^T \mat{X} \vec{b} = \mat{X}^T \vec{y}\) in Equation 12.5 can be solved by Gaussian elimination. What are the dimensions of the products \(\mat{X}^T \mat{X}\) and \(\mat{X}^T \vec{y}\) when we have \(n\) samples and \(m\) variables? What else, besides compatibilty of dimensions of matrix products, do we need for a solution to exist?

Applications

Exercise 59. Deconvolution using single cell RNA signatures

Deconvolution is a technique whereby amounts of individual components in a mixed sample with unknown composition are inferred from previously measured signatures on pure components. An example of an application of deconvolution is deducing the fractions of different cell types in a mixture of cell types from RNA-seq data. In this question the technique is demonstrated using a simple example.

Suppose we know that there are three types of cells, \(a\), \(b\) and \(c\), and we measure four different mRNA in these cells, \(R_1 \ldots R_4\). We know that these cells have the following average numbers of each of the four RNA molecules (their “RNA-signatures”):

\(a\): \(5, 395, 200, 400\)
\(b\): \(300, 100, 540, 60\)
\(c\): \(500, 190, 10, 300\)

We measure the following numbers of molecules of these RNA’s in a sample:

\[ \begin{bmatrix} R_1 \\ R_2 \\ R_3 \\ R_4 \end{bmatrix} = \begin{bmatrix} 321000 \\ 145000 \\ 362000 \\ 171000 \end{bmatrix} \]

a. We want to estimate the unknown numbers of cells of types \(a\), \(b\) and \(c\) in this sample. First, express this problem as an (generally) unsolvable matrix equation, then as a solvable least squares problem.

b. Solve the problem either by hand or using a programming language like R, Python, matlab, or solve it using a (least squares) solver in a spreadsheet program. Note that a solver can also solve non-linear problems, whereas the least squares solution using matrix expressions only applies to linear systems of equations.

c. We could make the matrix equation solvable by just leaving out the measurements of one of the RNA molecules. Then we had three equations with three unknown variables. Why is this “solution” not a good idea?