Simple Linear Regression
Simple linear regression is a statistical method used to examine the relationship between two continuous variables: one independent variable ($X$) and one dependent variable ($Y$). It determines how $X$ can predict or explain $Y$.
Key Concepts
-
Independent Variable ($X$):
The predictor or explanatory variable. -
Dependent Variable ($Y$):
The response or outcome variable. -
Regression Line:
A line that best fits the data, showing the relationship between $X$ and $Y$. -
Equation of the Regression Line:
\[Y = \beta_0 + \beta_1X + \epsilon\]Where:
- $Y$: Predicted value of the dependent variable
- $\beta_0$: Intercept (value of $Y$ when $X = 0$)
- $\beta_1$: Slope (change in $Y$ for a one-unit increase in $X$)
- $\epsilon$: Error term
Assumptions of Simple Linear Regression
- Linearity: The relationship between $X$ and $Y$ is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The variance of residuals (errors) is constant across all levels of $X$.
- Normality: Residuals follow a normal distribution.
Steps to Perform Simple Linear Regression
Step 1: Formulate the Model
Define the dependent ($Y$) and independent ($X$) variables.
Step 2: Collect and Explore Data
Examine the dataset for missing values, outliers, and patterns.
Step 3: Fit the Model
Estimate the coefficients ($\beta_0, \beta_1$) using the least squares method, which minimizes the sum of squared residuals:
\[\text{Residual} = Y_{\text{observed}} - Y_{\text{predicted}}\]Step 4: Evaluate the Model
Assess the fit of the regression line using metrics like $R^2$, p-value, and standard error.
Step 5: Interpret Results
Understand the relationship between $X$ and $Y$ based on the slope and intercept.
Example
Problem:
A researcher wants to study the relationship between study hours ($X$) and test scores ($Y$) for a group of students. The data is:
Study Hours ($X$) | Test Scores ($Y$) |
---|---|
2 | 50 |
4 | 55 |
6 | 60 |
8 | 70 |
10 | 85 |
Solution:
-
Equation of Regression Line:
\[Y = \beta_0 + \beta_1X\] -
Calculate Coefficients:
Using the least squares method:
-
Slope ($\beta_1$):
\[\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}\] -
Intercept ($\beta_0$):
\[\beta_0 = \bar{Y} - \beta_1\bar{X}\]
After calculations:
\[\beta_1 = 3.5, \quad \beta_0 = 46\] -
-
Regression Equation:
\[Y = 46 + 3.5X\] -
Predicted Values:
For $X = 6$:
\[Y = 46 + 3.5(6) = 67\]
Model Evaluation Metrics
- R-Squared ($R^2$): Measures the proportion of variance in $Y$ explained by $X$.
- Value ranges from 0 to 1.
- Higher values indicate a better fit.
-
Standard Error:
Represents the average distance of the observed values from the regression line. - P-Value:
Tests the null hypothesis that there is no relationship between $X$ and $Y$.- If $p \leq \alpha$ (e.g., 0.05), reject the null hypothesis.
Applications of Simple Linear Regression
-
Business:
Predicting sales based on advertising expenses. -
Healthcare:
Estimating blood pressure changes based on age. -
Education:
Analyzing the impact of study time on academic performance.
Visualization of Regression
Scatter Plot with Regression Line:
- A scatter plot shows individual data points.
- The regression line depicts the relationship between $X$ and $Y$.
Conclusion
Simple linear regression is a foundational statistical tool for analyzing relationships between two variables. By understanding its assumptions, calculations, and interpretations, you can derive meaningful insights and make data-driven decisions.
Next Steps: Multiple Regression