Sunday, February 24, 2019

Linear regression: Polynomial basis function

From the previous post, a first order model fitted the data like this:

To fit a known data better, we need more information and a more complex model. Here is a first order model used in the plot above:
$$y=ax+b$$
Let's turn it into second-order model:
$$y=ax^{2}+bx+c$$
Then, rewrite the equation into matrix form:
$$\begin{bmatrix}y_{0}\\ \vdots \\y{n}\end{bmatrix}=\begin{bmatrix}x_{0}^{2}&x_{0}&1\\ \vdots &\vdots &\vdots \\x_{n}^{2}&x_{n}&1\end{bmatrix}\times\begin{bmatrix}a\\b\\c\end{bmatrix} $$ $$n\in data\: numbers$$
Here we define the basis function as:
$$\phi(x)=\begin{bmatrix}x^{2} & x &1\end{bmatrix},\:1\:is\:the\:bias\:term$$
So we can still solve it with least square solution:
$$y = X\omega\Rightarrow\omega=(X^{T}X)^{-1}X^{T}y$$
Here I've got the result:$\omega=\begin{bmatrix}-0.2954\\5.3854\\-17.4236\end{bmatrix}$
From the plot, we can see that the RSS has decreased and fit the data better. So by increasing the order, increases the variance of the model and further reduces the training error. Training error is the error calculated by applying the training data.
Let's increase the order of the basis function and see what happens:

A more complex model should fit the data better, however the validation error RSS started to increase enormously at order over 6. What happened?
This was due to the matrix $X^{T}X$ became non invertible as order increased and the eigenvalues are close to zero. So add some regularization to the diagonal elements will solve this problem. This is called the "Ridge regression", since the solution of non invertible $X^{T}X$ looks like a ridge in 3D space. Where a desire function should be a convex function so that we can find an optimal solution.
So here is what ridge regression does:
$$y=(X^{T}X+\lambda I)^{-1}X^{T}y$$
$\lambda = 0.1$ or smaller, we don't want this term to dominant the matrix.
Here's the result of ridge regression:

Observe the plots, increasing the order fits the model better and better. Isn't it great? Why don't we just fit a model with a high order basis function? To be more specific, the model fits only these training data well. How about the new data coming in? Does it fit this data well too? Usually not. 
Therefore, while training the model with a data set, it usually split into two data set to training data and validation data. The model will calculated by the training data and test the error with validation data. The important part is that validation data will not be taken in account while calculating the model. So let's see what's the validation error of the high order model I just trained. I picked the last 50 data set as my validation data set. That is 100 training data and 50 validation data. Usually this should pick randomly to simulate a more general situation. Here's the result:

The plot shows that the training error decreases as the order increases. However, starting at order 3, the validation error begins to rise. This phenomenon is called "overfit". The model fits the training data too well that it performs poorly with validation data. This loses the generative properties. As a result, higher order model give us more complexity to fit the model (high variance low bias), but also comes with cost.


No comments:

Post a Comment

CSES Subtree Queries

You are given a rooted tree consisting of n nodes. The nodes are numbered 1,2,…,n, and node 1 is the root. Each node has a value. Your...