Logistic Regression

Logistic regression is a statistical method used to model and predict the probability of a binary outcome (e.g., success/failure, yes/no) based on one or more independent variables. Unlike linear regression, logistic regression predicts probabilities, which are then used to classify outcomes.


Key Concepts

  1. Binary Dependent Variable:
    The outcome variable ($Y$) has two possible values, typically coded as 0 and 1.
    • Example: 1 = Approved, 0 = Not Approved.
  2. Independent Variables:
    These can be continuous, categorical, or a mix of both.
    • Example: Income, age, education level.
  3. Logistic Function (Sigmoid Function):
    Converts the linear combination of inputs into probabilities between 0 and 1.

    \[P(Y=1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_kX_k)}}\]
  4. Log-Odds Transformation: The logistic model estimates the log-odds of the probability:

    \[\text{Log-Odds} = \ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_kX_k\]

Assumptions of Logistic Regression

  1. Binary Outcome: The dependent variable is binary.
  2. Independence of Observations: Each observation is independent.
  3. Linearity in the Log-Odds: The relationship between the independent variables and the log-odds is linear.
  4. No Multicollinearity: Independent variables are not highly correlated.

Types of Logistic Regression

  1. Binary Logistic Regression:
    Predicts a binary outcome (e.g., pass/fail).

  2. Multinomial Logistic Regression:
    Predicts outcomes with more than two categories without an order (e.g., types of transportation).

  3. Ordinal Logistic Regression:
    Predicts outcomes with ordered categories (e.g., customer satisfaction levels: low, medium, high).


Steps in Logistic Regression

Step 1: Define the Model

Specify the dependent and independent variables.

Step 2: Fit the Model

Estimate the coefficients ($\beta_0, \beta_1, \dots$) using maximum likelihood estimation (MLE).

Step 3: Interpret Coefficients

  • Positive Coefficient: Increases the log-odds of $Y = 1$.
  • Negative Coefficient: Decreases the log-odds of $Y = 1$.

Step 4: Predict Probabilities

Convert the log-odds to probabilities using the logistic function.

Step 5: Evaluate the Model

Assess model performance using accuracy, precision, recall, and $AUC-ROC$ (Area Under the Curve - Receiver Operating Characteristic).


Example

Problem:

A bank wants to predict whether a loan applicant will default ($Y = 1$) or not ($Y = 0$) based on income ($X_1$) and credit score ($X_2$).

Dataset:

Income ($X_1$) Credit Score ($X_2$) Default ($Y$)
40,000 600 1
50,000 650 0
60,000 700 0
70,000 750 0
80,000 800 1

Solution:

  1. Logistic Model:

    \[P(Y=1 \mid X_1, X_2) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2)}}\]
  2. Estimated Coefficients:

    \[\beta_0 = -10\] \[\beta_1 = 0.0001\] \[\beta_2 = 0.02\]
  3. Prediction for New Applicant: Applicant with income $X_1 = 65,000$ and credit score $X_2 = 720$:

    \[\text{Log-Odds} = -10 + 0.0001(65,000) + 0.02(720) = -10 + 6.5 + 14.4 = 10.9\] \[P(Y=1) = \frac{1}{1 + e^{-10.9}} \approx 0.999\]

    The applicant is very likely to default.


Model Evaluation Metrics

  1. Confusion Matrix:
    • True Positive (TP): Correctly predicted defaults.
    • False Positive (FP): Incorrectly predicted defaults.
    • True Negative (TN): Correctly predicted non-defaults.
    • False Negative (FN): Incorrectly predicted non-defaults.
  2. Accuracy:

    \[\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total Observations}}\]
  3. Precision:

    \[\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}\]
  4. Recall:

    \[\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]
  5. AUC-ROC: Measures the model’s ability to distinguish between classes. A higher value indicates better performance.

Applications of Logistic Regression

  1. Healthcare: Predicting the likelihood of a disease based on patient attributes.
  2. Finance: Determining loan defaults or creditworthiness.
  3. Marketing: Classifying customers as likely or unlikely to purchase.

Conclusion

Logistic regression is a powerful and widely used statistical tool for binary classification problems. By understanding its principles, assumptions, and evaluation metrics, you can use logistic regression to make accurate predictions and informed decisions.


Next Steps: ANOVA and MANOVA


Copyright © 2025. Distributed under MIT license.