Simulate, simulate, simulate

Running a linear regression is relatively easy. Everyone (and I mean everyone) has all happily typed “lm()” into R. However, while running regression on a data set is easy… How do you know you’re right when doing statistical analyses? One can scoff that it’s impossible to know if you’re right. A very valid point! But we use the mathematical formulation of linear regression to simulate data. Then, we can run a statistical analysis! This document is heavily inspired from Using Simulation to Study Linear Regression by LeRoy A. Franklin which I encourage everyone to read.

Linear regression

Simple 11 Yet very powerful linear regressions are described as: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon, \]

where \(Y_i\) are the outcomes and \(X_i\) are the observations, \(\beta_0\) is the intercept, \(\beta_1\) is the slope describing the relationship between \(Y\) and \(X\). The error term , \(\epsilon\), describes how the observed values deviate from the true values. Remember, we are usually trying to estimate what \(\beta_0\) , \(\beta_1\), and \(\sigma^2\), are.

Using the above equation, we can now simulate our own data:

\[ y_i = 3 + 2x_i + \epsilon . \]

Specifically, we set \(\beta_0 = 3\) and \(\beta_1 = 2\). For the error term (\(\epsilon\)), an important assumption is that it is normally distributed with mean 0 and standard deviation \(\sigma^2\)22 \(\sigma\) is the standard deviation and \(\sigma^2\) is the variance. Remember, the standard deviation is the square root of the variance!. In other words, \(\epsilon \sim {N(0,\sigma^2)}\). But what does that mean? Let’s write a simple code to generate some normal distributions with mean 0 and different variances.

x_interest = seq(-6,6, length = 1000)
mean_interest = 0 #the mean is 0
sigma_normal = c(0.5,1,2) #the standard deviations that we're interested in$

Normal distributions with mean 0 and different standard deviations. As you increase the variance, the wider the distribution is

The normal distributions are centered on the same mean but the variance is quite different! If we sample from a narrower distribution, the numbers would not deviate that far from 0. However, if we have a higher variance (i.e., a wider distribution), then the numbers can deviate further from 0.

With this in mind, we can add an error term to our equation and set the variance of the error term:

\[ y_i = 3 + 2x_i + \epsilon. \]

One way to think about the above equation is that we have basically the deterministic part \(3 + 2 x_i\) and now we’re adding noise that is described with a normal distribution. So for example, let’s say the input I’m interested is from 1 to 10. For each observation, we sample from the normal distribution and add it to the y-output. One assumption of linear regression is that the variance of the residuals has to be the same. From the simulation, the noise we are adding is from the same normal distribution. If the variance is different for each observation (for example, what if for an input of 3, the error term is described from a normal distribution with mean 0 and a standard deviation of 2), than our model is unable to accurately predict what we observed.

I can sample from the normal distribution by using the R function rnorm. With rnorm I am telling R: “Hey, with a normal distribution of mean 0 and a standard deviation of 1… Can you give me ten random values please33 “Be nice to machines to survive the AI uprising! I heard a lot of people say please and thank you to Chatgpt”!”

set.seed(24601)
rnorm_generated <- data.frame(num = rnorm(10, mean = 0, sd = 1),
                             frame = seq(1,10)) #frame is just a way to identify
print(rnorm_generated)

##           num frame
## 1  -0.2561510     1
## 2   0.5834819     2
## 3   0.4735194     3
## 4  -1.8780212     4
## 5  -0.2743209     5
## 6  -0.6362024     6
## 7   0.2569914     7
## 8  -0.2572872     8
## 9  -1.2291214     9
## 10 -3.2222501    10

Let’s visually see what is happening!

What does this say? For each observation, I’m going to add the value that I sampled from the normal distribution. Let’s show how this looks like!

The noise is sampled from the same normal distribution!

So the grey line and points represent the deterministic part (\(3+ 2x\)) and the colored points represent the output the error term is added! We now simulated data with the underlying process: \(3 + 2x + \epsilon\).

Mere man

Without our omniscient powers, we would only see the colored points! We simulated the underlying dynamics, so we know the truth, but in the world of man, we now have to figure out what \(\hat{B}_0\) and \(\hat{B}_1\) are.

Sum of least squares

The typical method for solving linear regressions is by fitting a line that minimizes the vertical distances between the line and the observations. We do this by using the sum of least squares. It’s simple data-fitting and all we have to do is some calculations.

You can skip it if you just want the answers, but a good coding exercise would be to code this yourself.

calculate_least_squares <- function(data.frame){
        N = length(data.frame$x_input)
        sum_func.xy = sum(data.frame$x_input * data.frame$y_output_error)
        sum_func_x_y = sum(data.frame$x_input) *         sum(data.frame$y_output_error)
        sum_func.x2 = sum(data.frame$x_input^2)
        sum_fun.x_2 = sum(data.frame$x_input)^2
        
        sum_y = sum(data.frame$y_output_error) 

        Slope = round(((N * sum_func.xy) - sum_func_x_y)/
                     ((N*sum_func.x2)  - sum_fun.x_2 ),2)
        
        slope_sum_x <-  Slope * sum(data.frame$x_input)

        Intercept = round((sum_y - slope_sum_x)/N,2)

       return(c(paste("B0",Intercept, sep = "="),
                paste("B1", Slope,sep = "=")))
}

Let’s try it out:

calculate_least_squares(regression_df)

## [1] "B0=3.57" "B1=1.78"

I got the intercept of 3.57 and the slope is 1.78! Similar to our simulated \(B_0\) and \(B_1\). Let’s also calculate the standard error of the residuals:

\[ \frac{\sum (y - \hat{y})^2}{n -2}\]

pop_se <- sqrt(sum(((3.57 + 1.78 *seq(1,10))-regression_df[,"y_output_error"]) ^2)/(10 -2))

print(pop_se)

## [1] 1.022929

We see that the variance of the error term is 1.023

Verifying with the lm function

Moment of truth, does our findings match with the R function?

summary(lm(y_output_error ~ x_input, data= regression_df))

## 
## Call:
## lm(formula = y_output_error ~ x_input, data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5704 -0.4359  0.2285  0.5284  1.2372 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5892     0.6987   5.137 0.000889 ***
## x_input       1.7758     0.1126  15.769 2.61e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 8 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9649 
## F-statistic: 248.7 on 1 and 8 DF,  p-value: 2.613e-07

The results are similar to what we got when we see that’s what we found when we estimate \(B_0\) (estimate of the intercept), \(B_1\) (estimate of the x_input), and \(\sigma^2\) (the residual standard error)! Now as a lesson to the reader… How does increasing the sample size affect our estimates?

Acknowledgement

Thank you to my partner (Caitlin Lienkaemper) who read the first draft.