May 6th
Today I learned the definition of covariance, from the Wikipedia page . The definition is as follows.
Definition. Given random variables $X$ and $Y,$ we define their "covariance'' as \[\op{cov}(X,Y):=\mathbb E\big[(X-\mathbb E[X])(Y-\mathbb E[Y])\big].\]
Roughly speaking, this measures how "related'' the two variables are. For example, the toy case $X=Y$ gives\[\mathbb E\big[(X-\mathbb E[X])(Y-\mathbb E[Y])\big]=\mathbb E\big[(X-\mathbb E[X])^2\big]=\op{var}(X),\]which just measures the variance; similar holds for $Y=aX$ for constant $a.$ Of course, there are some caveats here. For example, if $X$ is a constant, then $\op{cov}(X,Y)=0$ because $X-\mathbb E[x]=0.$ My interpretation of this is that we can pretty easily predict $X$ from $Y$ (it's constant), giving a kind of vacuous correlation.
We briefly remark that linearity of expectation lets us expand this out, assuming that everything exists. Expanding, we see\[\op{cov}(X,Y)=\mathbb E\big[XY-X\mathbb E[Y]-Y\mathbb E[X]+\mathbb E[X]\mathbb E[Y]\big].\]Now, linearity of expectation (and using the fact that $\mathbb E[X]$ and $\mathbb E[Y]$ are constants), this becomes\[\op{cov}(X,Y)=\mathbb E[XY]-\mathbb E[X]\mathbb E[Y]-\mathbb E[Y]\mathbb E[X]+\mathbb E[X]\mathbb E[Y],\]which is\[\op{cov}(X,Y)=\mathbb E[XY]-\mathbb E[X]\mathbb E[Y].\]This form is maybe conceptually easier because it is very directly computes how far from independence we are, in that independence implies $\mathbb E[XY]=\mathbb E[X]\mathbb E[Y].$
Anyways, to align with this intuition, we have the following definition.
Definition. We say that two random variables $X$ and $Y$ are "uncorrelated'' if and only if $\op{cov}(X,Y)=0.$
Note that this really means uncorrelated in a linear sense because that's what $\op{cov}$ can detect. A quick example is to make $X\in\{-2,-1,1,2\}$ chosen with equal probability, and then we set $Y:=X^2.$ These are certainly not independent (namely, $0=\mathbb P(X\le-2,Y\le1) \lt \mathbb P(X\le-2)\mathbb P(Y\le1)$), but\[XY=X^3\in\{-8,-1,1,8\}\]all with equal probability, implying that $\mathbb E[XY]=0.$ It follows\[\op{cov}(X,Y)=\mathbb E[XY]-\mathbb E[X]\mathbb E[Y]=0-0\mathbb E[Y]=0.\]So indeed, $X$ and $Y$ are not independent, yet covariance was unable to detect this. I am under the impression this failure is because the relationship is not a linear one.
As a final remark one why we should care about covariance, we note the following.
Proposition. The mapping $\op{cov}$ induces a symmetric bilinear pairing.
There are a couple of things we need to check, which we do in sequence. Fix $X,Y,Z$ random variables and $a$ a constant.
-
For symmetric, we have to check $\op{cov}(X,Y)=\op{cov}(Y,X),$ which degenerates into the statement \[\mathbb E[XY]-\mathbb E[X]\mathbb E[Y]=\mathbb E[YX]-\mathbb E[Y]\mathbb E[X],\] which is true by, say, commutativity of multiplication.
-
To be a bilinear pairing, we also need to check that $\op{cov}(aX,Y)=a\op{cov}(X,Y).$ Well, this expands into \[\mathbb E[aXY]-\mathbb E[aX]\mathbb E[Y]=a\mathbb E[XY]-a\mathbb E[X]\mathbb E[Y],\] but we can move the constant $a$ in and out by linearity of expectation.
-
Finally, we also need to check that $\op{cov}(X+Y,Z)=\op{cov}(X,Z)+\op{cov}(Y,Z),$ which expands into needing \[\mathbb E[(X+Y)Z]-\mathbb E[X+Y]\mathbb E[Z]=\mathbb E[XZ]-\mathbb E[X]\mathbb E[Z]+\mathbb E[YZ]-\mathbb E[Y]\mathbb E[Z].\] Distributing $(X+Y)Z=XZ+YZ$ and then using linearity of expectation will finish this.
This is enough to show that we have a symmetric bilinear form, so we are done here. $\blacksquare$
We remark that we are at least positive semi-definite because $\op{cov}(X,X)=\mathbb E\left[(X-\mathbb E[X])^2\right]\ge0$ from before. However, we are not completely positive definite because random variables are potentially funny: for example, $X$ the floor of a random real number in $[0,1]$ will give $\mathbb E[X]=0$ and then\[\op{cov}(X,X)=\op{var}(X)=0,\]but of course $X$ isn't identically $0$ because maybe it'll be $1$ with probability $0.$ So we don't have a legitimate inner product, but it's close enough. The real point here is to motivate looking at things like the covariance matrix, defined as\[\begin{bmatrix} \op{cov}(X_1,X_1) & \op{cov}(X_1,X_2) & \cdots & \op{cov}(X_1,X_n) \\ \op{cov}(X_2,X_1) & \op{cov}(X_2,X_2) & \cdots & \op{cov}(X_2,X_n) \\ \vdots & \vdots & \ddots & \vdots \\ \op{cov}(X_n,X_1) & \op{cov}(X_n,X_2) & \cdots & \op{cov}(X_n,X_n\end{bmatrix}.\]Such a construction is natural when given a bilinear form; see also the discriminant of a number ring. The covariance matrix in particular is important for, say, principal component analysis (or so I am told).