Notifications
Clear all
General
1
Posts
1
Users
0
Reactions
435
Views
Topic starter
. Assumptions of Linear Regression
Linear regression relies on key statistical assumptions. Violating these can lead to biased or misleading results.
1.1. Linearity
- The relationship between the independent variables (X) and the dependent variable (Y) should be linear.
- Check: Use scatter plots to visualize relationships. If the relationship appears curved, consider polynomial regression or transformations (e.g., log, square root).
1.2. Independence of Errors
- Residuals (errors) should not be correlated.
- Check: Use the Durbin-Watson test for detecting autocorrelation.
- Fix: If autocorrelation exists, consider time-series models or adding lag variables.
1.3. Homoscedasticity (Constant Variance of Errors)
- The variance of residuals should remain constant across all levels of the independent variables.
- Check: Plot residuals vs. predicted values (should look random).
- Fix: Use log transformation, Weighted Least Squares (WLS), or heteroskedasticity-robust standard errors.
1.4. Normality of Residuals
- Residuals should be normally distributed (especially for small datasets).
- Check: Use a Q-Q plot or Shapiro-Wilk test.
- Fix: Apply transformations (log, square root, etc.) if residuals are skewed.
1.5. No Perfect Multicollinearity
- Independent variables should not be highly correlated with each other.
- Check: Use Variance Inflation Factor (VIF) (values > 10 indicate multicollinearity).
- Fix: Remove or combine correlated variables, use Principal Component Analysis (PCA), or Ridge Regression.
2. Best Practices for Building Linear Regression Models
2.1. Feature Selection
- Avoid using too many irrelevant predictors (causes overfitting).
- Use stepwise selection, Lasso regression, or domain knowledge to select the best predictors.
2.2. Scaling Features
- Some models benefit from standardization (mean = 0, variance = 1).
- Normalize variables especially when using regularization (Lasso, Ridge).
2.3. Handling Outliers
- Outliers can distort the regression coefficients.
- Check: Use box plots or leverage diagnostics like Cook’s Distance.
- Fix: Transform data, remove outliers if justified, or use robust regression.
2.4. Splitting Data for Validation
- Always split data into training and test sets (e.g., 80-20 split).
- Use cross-validation (e.g., k-fold) to assess model performance.
3. Evaluating Model Performance
3.1. Metrics for Linear Regression
- R² (Coefficient of Determination): Measures how well independent variables explain the dependent variable. A high R² (close to 1) is good but does not mean causation.
- Adjusted R²: Adjusts for the number of predictors, preventing overfitting.
- Mean Squared Error (MSE) / Root Mean Squared Error (RMSE): Measures prediction error.
- Mean Absolute Error (MAE): A simpler metric that measures absolute differences.
3.2. Avoiding Overfitting
- Too many predictors → High R² but poor generalization.
- Use regularization techniques (Ridge, Lasso) to penalize unnecessary complexity.
- Compare train vs. test performance to check for overfitting.
4. Interpreting Results Properly
- P-values (< 0.05): Check significance of predictors.
- Coefficient Signs & Magnitudes: Do they align with domain knowledge?
- Confidence Intervals: Provide range estimates for coefficients.
Posted : 02/03/2025 12:00 am