Fitting a Line to Data

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['font.size'] = 16
mpl.rcParams['legend.fontsize'] = 14
mpl.rcParams['figure.dpi'] = 150

Linear Least Squares with Matrices

Let’s fit a line of the form:

y=β0x+β1

Assuming that both y and x are vectors with n elements, we can rewrite this equation in matrix form:

[Y1Y2Yn]=[β0+β1X1β0+β1X2β0+β1Xn]
[Y1Y2Yn]Yn×1=[1X11X21Xn]Xn×2[β0β1]β2×1
Y=Xβ

Here, I’ll simulate some data with n=25:

# generate some random, linear data with random error
n = 25
x = np.random.uniform(size=n)
y = 2.4*x + 0.3 + np.random.normal(0,0.1,size=n)
# plot the data
plt.figure()
plt.plot(x,y,'o',label = 'data',mfc='gray',markeredgecolor='k',alpha=0.5)
plt.plot(x,2.4*x + 0.3,label = 'truth',color='crimson')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
<matplotlib.legend.Legend at 0x7fe37733f490>
../_images/fitting_a_line_to_data_6_1.png

We can now directly solve for the vector β after reshaping our data to the matrix shapes above:

ˆβ=(XTX)1XTY
# first reshape to make these the proper matrix forms
Y = y.reshape((-1,1)) # 100 x 1
# add a vector of ones as a column to X (see above equation)
vector_of_ones = np.ones(n)
X = np.column_stack((vector_of_ones, x))
print('X shape: ', X.shape)
print('Y shape: ', Y.shape)
X shape:  (25, 2)
Y shape:  (25, 1)
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
print('Beta Shape: ', beta_hat.shape)
Beta Shape:  (2, 1)
beta_hat
array([[0.32544314],
       [2.38931179]])
# generate a best fit line
x_fit = np.linspace(0,1,10)
y_fit = beta_hat[1]*x + beta_hat[0]
# plot the data, best fit line, and ground truth
plt.figure()
plt.plot(x,y,'o',label = 'data',mfc='gray',markeredgecolor='k',alpha=0.33)
plt.plot(x,2.4*x + 0.3,label = 'truth',c='crimson')
plt.plot(x,y_fit,label='least squares',c='darkblue',ls='dotted')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
<matplotlib.legend.Legend at 0x7fe377a4c810>
../_images/fitting_a_line_to_data_14_1.png

Weighted Least Squares

What if our data have measurement errors which are not constant? How do we weight our points properly?

We can slightly modify our expression for β above in the case of known measurement errors. For more math, see Section 2.2 of: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/24/lecture-24–25.pdf.

Another great resource: https://adrian.pw/blog/fitting-a-line/

The result is that we can define a weight matrix W=diag(1/σi) where σi are the measurement errors for the ith measurement. In this case, we have:

ˆβ=(XTWX)1XTWy.
# generate some random, linear data with random error
n = 25
x = np.random.uniform(size=n)

# evaluate the true model at the given x values
y = 1.17*x + 0.1

# Heteroscedastic Gaussian uncertainties only in y direction
y_err = np.random.uniform(0.05, 0.15, size=n) # randomly generate uncertainty for each datum
y = np.random.normal(y, y_err) # re-sample y data with noise
datastyle = dict(mec='k', linestyle='none',ecolor='k', marker='o', mfc='none', label='data',alpha=0.6,elinewidth=1)
plt.figure()
plt.errorbar(x, y, y_err, **datastyle)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.tight_layout()
../_images/fitting_a_line_to_data_19_0.png
# solve for beta after defining our matrices
# data matrix
Y = y.reshape((-1,1)) # 50 x 1

# design matrix
vector_of_ones = np.ones(n)
X = np.column_stack((vector_of_ones, x))

# weight matrix
W = np.diag(1/y_err**2)

# solve for beta
beta_hat_weighted = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ Y
beta_hat_unweighted = np.linalg.inv(X.T @ X) @ X.T @ Y
print(f'Weighted Least Squares:\t  y = {np.round(float(beta_hat_weighted[1]),4)}x+{np.round(float(beta_hat_weighted[0]),4)}')
print(f'Unweighted Least Squares: y = {np.round(float(beta_hat_unweighted[1]),4)}x+{np.round(float(beta_hat_unweighted[0]),4)}')
Weighted Least Squares:	  y = 1.175x+0.0903
Unweighted Least Squares: y = 1.1302x+0.103
# generate a best fit line
x_fit = np.linspace(0,1,10)
y_fit_weighted = beta_hat_weighted[1]*x_fit + beta_hat_weighted[0]
y_fit_unweighted = beta_hat_unweighted[1]*x_fit + beta_hat_unweighted[0]
cmap = plt.get_cmap('jet')
dictstlye = dict(mec='k', linestyle='none',ecolor='k', marker='o', mfc='none', label='data',alpha=0.6,elinewidth=1)
plt.figure()
plt.errorbar(x, y, y_err, **dictstlye)
plt.plot(x_fit,y_fit_weighted,color=cmap(0.2/1),label='weighted least square fit',lw=2)
plt.plot(x_fit,y_fit_unweighted,color=cmap(0.8/1),label='least square fit',lw=2)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.legend(loc='upper left')
plt.tight_layout()
../_images/fitting_a_line_to_data_23_0.png