Welcome to the exercise repository for day 3 of our course. To help you, we have prepared unit-tests. Use:
nox -s test
While coding, use nox -s lint
, and nox -s typing
to check your code.
Autoformatting help is available via nox -s
format.
Feel free to read more about nox at https://nox.thea.codes/en/stable/ .
Today's goal is to get more familiar with the three important concepts of linear regression, fitting a polynomial of higher order and regularization. You will be given some data in the form of pairs of (a, b)-values. So for each a-value there is one b-value. The main idea is to find an easy function (i.e. a polynomial) that best explains our data. That means, that if we plug in an a-value, the result should be close enough to the corresponding b-value.
A general polynomial
The
Or in short:
The optimal
where
For linear regression, the polynomial will be of order n=2 and will look like this:
This will just be a straight line and there are only two coefficients
The line b = pandas.read_csv('./data/noisy_signal.tab')
is used to load a noisy signal.
The line x_axis = np.linspace(0, 1, num=len(b_noise))
will provide you with corresponding x-values (these are the a-values from above and the lecture).
This is some artificial data that serves as a means to try out the concepts you have learned about in the lecture.
The first part will be concerned with modeling this signal using polynomials.
Linear regression is usually a good first step.
-
Start by implementing the function
set_up_point_matrix
from thesrc/regularization.py
module. The function should produce polynomial-coordinate matrices$\mathbf{A}_n$ of the form: $$ \mathbf{A}_n = \begin{pmatrix} 1 & a_1^1 & a_1^2 & \dots & a_1^{n-1} \\ 1 & a_2^1 & a_2^2 & \dots & a_2^{n-1} \\ 1 & a_3^1 & a_3^2 & \dots & a_3^{n-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & a_m^1 & a_m^2 & \dots & a_m^{n-1} \\ \end{pmatrix} $$ -
Go to the main-function and use the function you just implemented to create the point-matrix A for n=2.
-
Now, $$\mathbf{A}_2^{\dagger}\mathbf{b} = \mathbf{c} = \begin{pmatrix} c_1 \\ c_2 \\ \end{pmatrix} $$ will produce the coefficients for a straight line.
-
Evaluate your first-degree polynomial via
$c_1 + c_2 \cdot x$ and plot the result as well as the original data usingmatplotlib.pyplot
'splot
function.
The straight line above is insufficient to model the data. So perform the very same steps as above, but change the degree of the polynomial to n=300 (to set up a square matrix since we have 300 data-points):
- Set up the point matrix by setting n=300.
- Estimate the coefficients via
$$\mathbf{A}^{\dagger}\mathbf{b} = \mathbf{x}_{\text{fit}}.$$ - Having estimated the coefficients
$$\mathbf{A} \mathbf{x}_{\text{fit}}$$ computes the function values. Plot the original points and the function values using matplotlib. What do you see?
Unfortunately, the fit is not ideal. The polynomial is too complex and tracks the noise. The singular value decomposition (SVD) can help! Recall that the SVD turns a matrix
into the form:
In the SVD-Form, computing the pseudoinverse is simple! Swap U and V and replace every of the m singular values with it's inverse
This results in the matrix
A solution to the overfitting problem is to filter the singular values.
The idea is, that small singular values often correspond to directions in the data where noise dominates. And now we want to create a filter matrix, that gets rid of singular values if they are too small (i.e. if they fall below a threshold
Roughly speaking multiplication by
Apply the regularization by computing:
with
Setting n=300 turns A into a square matrix. In this case, the zero block in the sigma-matrix disappears and you don't have to worry about transposing
To sum it up, your tasks are:
- Compute the SVD of A.
Perform the following steps 2. - 4. for epsilon equal to 0.1, 1e-6, and 1e-12. 2. Compute the diagonal for the filter matrix and turn it into a matrix. 3. Estimate the regularized coefficients by applying the formula above. 4. Plot the result.
Another solution to the overfitting problem is reducing the complexity of the model. To assess the quality of polynomial fit to the data, compute and plot the Mean Squared Error (Mean Squared Error measure how close the regression line is to data points) for every degree of polynomial upto 20. So as before:
- Set up the point matrix for the current degree from 1 to 20.
- Estimate the coefficients.
- Compute the predictions.
- Calculate the MSE.
MSE can be calculated using the following equation, where
$N$ is the number of samples,$y_i$ is the original point and$\hat{y_i}$ is the predicted output.$$MSE=\frac{1}{N} \sum_{i=1}^{N} (y_i-\hat{y_i})^2$$ - Plot the MSE-error against the degree.
- Are the degree of the polynomial and the MSE linked? From the plot, estimate the optimal degree of polynomial and fit the polynomial with this specific degree.
Now we are ready to deal with real data! Feel free to use your favorite time series data or work with the Rhine level data we provide.
The file ./data/pegel.tab
contains the Rhine water levels measured in Bonn over the last 100 years.
Data source: https://pegel.bonn.de.
The src/pegel_bonn.py
file already contains code to pre-load the data for you.
The Rhine level measurements will be your new vector
- Generate a matrix A with n=2 using the timestamps for the data set as your x-values.
- Compute
$$\mathbf{A}^{\dagger}\mathbf{b}$$ to estimate the coefficients of the line. - Evaluate your polynomial and plot the result.
- Compute the zero-intercept with the y-axis. When do the regression line and the x-axis intersect?
Or in other words: On which day will the Rhine water level be at 0 cm?
Hint: Plug in
$y=0$ into the equation of your line with the estimated coefficients and solve for the date$x$ .
Re-using the code you wrote for the proof of concept task, fit a polynomial of degree 20 to the data. Before plotting have a closer look at datetime_stamps
and its values and scale the axis appropriately.
- Scale the x-axis.
- Set up the point matrix for the scaled x-axis with n=20.
- Compute the coefficients.
- Evaluate the polynomial.
- Plot the result.
Focus on the data from the year 2000 onward and filter the singular values.
We will use again a degree of 20.
Matrix A is not square in this case, because the degree is smaller than the number of datapoints! Consequently, a zero block must appear in your singular value matrix and when computing the Pseudoinverse from the SVD,
- Compute the SVD of the point matrix from the previous task.
Perform the following steps 2. - 4. for epsilon equal to 0.1, 1e-3, and 1e-9. 2. Compute the filter matrix. 3. Estimate the regularized coefficients by applying the formula from before.
Hint: Remember the zero-block! You need degree-many rows and number-of-datapoints-many columns!
- Evaluate the regularized polynomial and plot the results.