Cover
Inizia ora gratuitamente 6-Correlation and regression.pdf
Summary
# Correlation analysis
Correlation analysis explores the relationship between two variables, providing insights into their direction and strength.
## 1. Correlation analysis
Correlation is utilized to assess the relationship between two variables, which can be either continuous or ordinal. It provides two key pieces of information: the direction of the relationship (positive or negative) and the strength of the relationship (weak, medium, or strong). Correlation is frequently explored visually using a scatterplot, where one variable is plotted on the X-axis and the other on the Y-axis. Each point on the scatterplot represents an observation with its corresponding values for the two variables [2](#page=2) [3](#page=3).
### 1.1 The correlation coefficient
The correlation coefficient (often denoted as $r$ for Pearson's and $\rho$ for Spearman's) quantifies the strength and direction of the linear relationship between two variables [3](#page=3).
#### 1.1.1 Range of the correlation coefficient
The correlation coefficient ranges from -1 to 1 [3](#page=3).
#### 1.1.2 Direction of the relationship
* **Positive correlation:** If the correlation coefficient is positive, it indicates a positive association where an increase in one variable is associated with an increase in the other [4](#page=4).
* **Negative correlation:** If the correlation coefficient is negative, it indicates a negative association where an increase in one variable is associated with a decrease in the other [4](#page=4).
* **Zero correlation:** A value of zero suggests no linear association between the variables [4](#page=4).
#### 1.1.3 Strength of the relationship
The strength of the relationship is determined by how close the correlation coefficient is to 1 or -1. While different textbooks may present slightly varied cut-off values, a general guideline for interpreting the strength is as follows [5](#page=5):
* **Perfect correlation:** A value near $\pm 1$.
* **Strong correlation:** A coefficient value between $\pm 0.50$ and $\pm 1$.
* **Moderate correlation:** A value between $\pm 0.30$ and $\pm 0.49$.
* **Small correlation:** A value below $\pm 0.29$.
* **No correlation:** A value of zero.
Some scientists suggest the following interpretation guide, irrespective of the sign:
* 0 to 0.19: Very weak
* 0.2 to 0.39: Weak
* 0.40 to 0.59: Moderate
* 0.60 to 0.79: Strong
* 0.80 to 1: Very strong
The significance of a correlation is typically determined by its p-value; a correlation is considered significant if the p-value is less than 0.05 [5](#page=5).
#### 1.1.4 Coefficient of determination ($R^2$)
The coefficient of determination ($R^2$) represents the proportion of variance in one variable that can be explained by the other variable. It is calculated by squaring the correlation coefficient ($r^2$). For example, if $r = -0.67$, then $R^2 = (-0.67)^2 = 0.44$. This indicates that 44% of the variation in one variable (e.g., BMI) can be accounted for by knowing the value of the other variable (e.g., physical activity) [6](#page=6).
> **Tip:** Always plot your data using a scatterplot before conducting a correlation analysis to visually confirm if the relationship appears linear [6](#page=6).
### 1.2 Types of correlation
The choice of correlation coefficient depends on the nature of the data.
#### 1.2.1 Pearson's correlation coefficient ($r$)
Pearson's correlation is appropriate for parametric data, meaning it is used for numerical variables that are normally distributed [6](#page=6).
#### 1.2.2 Spearman's rank correlation coefficient ($\rho$)
Spearman's rho coefficient is used for ordinal data (ranked data) or when the assumptions of normality for the data are not met. It is considered the non-parametric equivalent of Pearson's correlation [6](#page=6).
> **Tip:** The decision tree for choosing a correlation test involves considering whether the data is bivariate/multivariable, assesses difference/correlation, involves independent/paired samples, the type of outcome variable (continuous, ordinal, and its normality), and the number of groups [2](#page=2).
| Feature | Pearson's correlation ($r$) (Parametric) | Spearman's correlation ($\rho$) (Non-parametric) |
| :---------------------- | :----------------------------------------- | :----------------------------------------------- |
| **Variables** | Two numerical variables | Ordinal or numerical variables |
| **Relationship Type** | Linear relationship | Monotonic (linear or curvilinear) relationship |
| **Data Distribution** | Normal distribution (for at least one of the two variables) | No specific distribution assumption |
#### 1.2.3 When to use each
* Pearson's correlation can be used for two numerical variables with a linear relationship and normally distributed data [6](#page=6) [7](#page=7).
* Spearman's correlation can be used for ordinal variables or when normality assumptions are violated, and it assesses monotonic relationships [6](#page=6) [7](#page=7).
* Neither can be used for non-monotonic relationships [7](#page=7).
### 1.3 Important considerations
* **Correlation does not imply causation:** The presence of a correlation between two variables does not mean that one variable causes the other; there might be confounding factors or the relationship could be coincidental [7](#page=7).
> **Example:** Ice cream sales and crime rates often show a positive correlation. However, ice cream does not cause crime, nor does crime cause people to buy ice cream. The confounding variable is likely temperature – both ice cream sales and crime rates tend to increase during warmer weather.
### 1.4 Finding correlation analysis in SPSS
In SPSS, correlation analysis can typically be found under: Analyze $\rightarrow$ Correlate $\rightarrow$ Bivariate [7](#page=7).
### 1.5 Reporting significant results
When reporting a significant correlation, it is important to include the type of correlation used, whether the relationship was statistically significant, its direction, strength, the correlation coefficient ($r$ or $\rho$), the p-value, and the coefficient of determination ($R^2$) if applicable [7](#page=7).
> **Example Reporting:** "A Pearson's correlation was run to assess the relationship between BMI and physical activity among a sample of university students. There was a statistically significant, strong negative correlation between BMI and physical activity, $r = -0.67$, $p = 0.035$, with physical activity explaining about 44% of the variation in BMI." [7](#page=7).
### 1.6 Correlation matrix
A correlation matrix is a table that summarizes the correlation coefficients between several pairs of continuous variables. This format allows for easy identification of the strongest and weakest correlations among multiple variables. The diagonal of a correlation matrix always contains ones, as the correlation of a variable with itself is perfect [8](#page=8).
> **Example Correlation Matrix:**
>
> | | English | Math | Writing | Reading |
> | :------ | :------ | :---- | :------ | :------ |
> | English | 1 | | | |
> | Math | 0.271 | 1 | | |
> | Writing | 0.366 | 0.149 | 1 | |
> | Reading | 0.386 | 0.520 | 0.152 | 1 |
>
> In this matrix, the strongest correlation is between Math and Reading scores ($r=0.520$), while the weakest is between Math and Writing scores ($r=0.149$) [8](#page=8).
### 1.7 Heatmap
A heatmap provides a graphical representation of a correlation matrix. Colors are used to distinguish between positive and negative correlations, with one color (e.g., blue) typically representing positive relationships and another (e.g., red) representing negative ones. The intensity of the color indicates the strength of the correlation. Heatmaps can be generated using statistical software or spreadsheet programs like Excel [9](#page=9).
### 1.8 Online calculators
Several online calculators are available for computing correlation coefficients:
* Pearson's correlation: [https://www.socscistatistics.com/tests/pearson/default2.aspx](https://www.socscistatistics.com/tests/pearson/default2.aspx) [9](#page=9).
* Spearman's correlation: [https://www.socscistatistics.com/tests/spearman/default2.aspx](https://www.socscistatistics.com/tests/spearman/default2.aspx) [9](#page=9).
---
# Simple linear regression
Simple linear regression quantifies the linear relationship between two variables for prediction and understanding impact [13](#page=13).
### 2.1 Purpose and applications of simple linear regression
Simple linear regression is utilized for two primary purposes:
* **Studying associations:** Similar to correlation, it examines the relationship between two variables [13](#page=13).
* **Quantifying relationships:** It goes beyond correlation by generating a regression equation that describes the relationship, enabling prediction of the outcome variable [13](#page=13).
**Uses include:**
* Evaluating the impact of an independent variable on an outcome [13](#page=13).
* Predicting the outcome variable using the independent variable [13](#page=13).
> **Tip:** While it studies associations, remember that correlation and regression do not imply causation [18](#page=18).
### 2.2 Variables in simple linear regression
* **Dependent variable (outcome, y):** This is the variable being predicted and must be numerical [13](#page=13).
* **Independent variable (predictor, x):** This variable is used to predict the outcome. It can be numerical, ordinal, or categorical [13](#page=13).
### 2.3 The regression equation and its components
The fundamental equation for simple linear regression is represented as:
$$y = b_0 + b_1x$$ [14](#page=14).
**Components of the equation:**
* **y:** The outcome or dependent variable [14](#page=14).
* **x:** The predictor or independent variable [14](#page=14).
* **$b_0$ (Intercept or constant):** This represents the value of the dependent variable ($y$) when the independent variable ($x$) is zero [14](#page=14).
> **Example:** In a model predicting salary based on years of experience, an intercept of 1500 dollars signifies that a fresh graduate with zero years of experience is expected to earn 1500 dollars [15](#page=15).
> **Caveat:** The interpretation of the intercept is only meaningful if $x=0$ is a plausible or possible value within the context of the data. For instance, a waist circumference of zero is not physically possible, making the intercept in such a model uninterpretable [15](#page=15).
* **$b_1$ (Slope):** This indicates the amount of change (positive or negative) in the dependent variable ($y$) for each one-unit increase in the independent variable ($x$) [14](#page=14).
> **Example:** If the slope ($b_1$) is 0.45 in a model predicting HbA1c from blood glucose, it means that for every one mmol/L increase in blood glucose, HbA1c is expected to increase by 0.45% [13](#page=13).
> **Example:** In the salary prediction model, a slope of 250 dollars means that for each additional year of experience, an employee's salary is expected to increase by 250 dollars [15](#page=15).
### 2.4 Fitting the regression line: The least squares method
Linear regression aims to fit the best straight line through the data points. The most common method for achieving this is the **least squares method**. This method finds the line that minimizes the sum of the squared vertical distances (residuals) between each data point and the line [16](#page=16).
### 2.5 Residuals
Residuals represent the difference between the observed (actual) values and the predicted values from the regression model. They are the errors in prediction [17](#page=17).
The population model includes an error term ($e$):
$$Y = \beta_0 + \beta_1X + e$$ [17](#page=17).
The residual ($e$) for a specific observation is calculated as:
$$e = Y - (\beta_0 + \beta_1X)$$ [17](#page=17).
This is equivalent to:
$$e = \text{observed value} - \text{predicted value}$$ [17](#page=17).
* Residuals represent the vertical distance of each data point from the regression line [17](#page=17).
* A good regression model is characterized by small residuals [17](#page=17).
> **Example:** If an employee with 2 years of experience is predicted to earn 2000 dollars, their actual salary might be 2000, 2100, or 1800 dollars. The residuals would be 0, 100, and -200 dollars, respectively [17](#page=17).
### 2.6 Coefficient of determination ($R^2$)
The coefficient of determination, denoted as $R^2$, quantifies the proportion of variability in the dependent variable that can be explained by the independent variable through their linear relationship [18](#page=18).
* $R^2$ can be obtained by squaring the Pearson's correlation coefficient ($r$) between the two variables ($R^2 = r^2$) [18](#page=18).
* It is typically expressed as a percentage, indicating the proportion of variance in the outcome variable explained by the predictor [18](#page=18).
* The value of $R^2$ ranges from 0 to 1 (or 0% to 100%) [18](#page=18).
* $R^2 = 1$ indicates perfect predictability [18](#page=18).
* $R^2 = 0$ indicates no predictive capability of the model [18](#page=18).
* It's important to note that "explained" does not imply causality [18](#page=18).
> **Example:** If $R^2 = 0.83$ in a model predicting BMI from waist circumference, it means that 83% of the variability in BMI can be explained by waist circumference using this linear model [18](#page=18).
**Adjusted $R^2$ ($R^2_{adj}$):** This is a modified version of $R^2$ used primarily in multiple linear regression. It accounts for the sample size and the number of coefficients in the model, providing a more adjusted measure of the model's fit [18](#page=18).
### 2.7 Simple linear regression versus correlation
Both simple linear regression and correlation are used to assess the association between two numerical variables [19](#page=19).
* The correlation coefficient between variable X and variable Y is the same as the correlation coefficient between Y and X [19](#page=19).
* However, the regression of Y on X ($Y = b_0 + b_1X$) yields a different equation and results compared to the regression of X on Y ($X = b_0 + b_1Y$) [19](#page=19).
* The sign (positive or negative) of the slope coefficient in a regression line is consistent with the sign of the correlation coefficient [19](#page=19).
### 2.8 Checking the model fit
To determine if a simple linear regression model is a "good" model for prediction, several checks are performed [19](#page=19):
1. **Check $R^2$ (or Adjusted $R^2$):** A higher $R^2$ value generally indicates better predictive power [19](#page=19).
2. **Check the significance of the ANOVA model:** A statistically significant p-value from the ANOVA output suggests the model has a good fit [19](#page=19).
3. **Check model assumptions:** Ensuring the underlying assumptions of linear regression are met is crucial for model validity [19](#page=19).
### 2.9 Assumptions of simple linear regression
For a simple linear regression model to have a good fit and reliable predictions, the following assumptions should ideally be satisfied [20](#page=20):
1. **Linearity:** There must be a linear relationship between the predictor variable ($x$) and the outcome variable ($y$). This is best checked visually with a scatterplot of the data before modeling [20](#page=20).
2. **No significant outliers:** Outliers are data points far from the general trend of the data and can negatively impact the model's predictive ability. Techniques like case-wise diagnostics and Cook's distance (should be less than 4/n) are used to identify significant outliers [20](#page=20).
3. **Independence of observations (residuals):** The observations (or residuals) should be independent of each other. Knowing the value of one case should not provide information about the value of another. The Durbin-Watson statistic is used to check for autocorrelation; values between 1.5 and 2.5 typically indicate independence [20](#page=20).
4. **Normality of residuals:** The residuals (errors) should be approximately normally distributed. This can be assessed using histograms or normal probability plots of the residuals (or standardized residuals) [20](#page=20).
5. **Homoscedasticity:** This assumption states that the variance of the outcome variable ($y$) is constant across all levels of the predictor variable ($x$). Visually, this means the spread of residuals is consistent along the regression line. Heteroscedasticity occurs when the spread of residuals changes systematically with the predictor variable. This is checked by plotting standardized residuals against standardized predicted values; a "funnel" shape indicates heteroscedasticity, while a consistent scatter suggests homoscedasticity [20](#page=20).
> **Example of Homoscedasticity:** A plot where the residuals are evenly distributed around the regression line with a constant variance [21](#page=21).
> **Example of Heteroscedasticity:** A plot where the spread of residuals increases or decreases as the predictor variable changes, forming a 'fan' or 'cone' shape [21](#page=21).
### 2.10 Avoiding extrapolation
A critical consideration when using regression models is to avoid extrapolation. This means the model should not be used to predict outcomes for predictor variable values that fall outside the range of values used to create the model [21](#page=21).
> **Example:** If a salary prediction model was built using data from employees with 0-10 years of experience, it should not be used to predict the salary of an employee with 15 years of experience, as this falls outside the original data range [21](#page=21).
> **Example:** Similarly, using extremely small waist circumference values to predict BMI in a model derived from a broader range of waist circumferences could lead to unreliable predictions [21](#page=21).
---
# Multiple linear regression
Multiple linear regression extends simple linear regression by incorporating multiple predictor variables to estimate an outcome variable [22](#page=22).
### 3.1 Model structure and variables
Multiple linear regression models aim to explain a single dependent variable (outcome, $y$) using one or more independent variables (predictors, $x$) [22](#page=22).
* **Dependent variable:** Must be numerical [22](#page=22).
* **Independent variables:** Can be numerical, ordinal, or categorical [22](#page=22).
The general form of the multiple linear regression equation is:
$$y = b_0 + b_1x_1 + b_2x_2 + \dots$$
where:
* $y$ is the dependent variable [22](#page=22).
* $x_1, x_2, \dots$ are the independent variables [22](#page=22).
* $b_0$ is the intercept coefficient [22](#page=22).
* $b_1, b_2, \dots$ are the slope coefficients, representing the change in the dependent variable for a one-unit increase in the corresponding independent variable, while controlling for all other predictors in the model [22](#page=22).
#### 3.1.1 Handling different variable types
##### 3.1.1.1 Numerical predictor variables
Numerical predictors are directly included in the regression equation. For example, adding 'age' to a model predicting BMI based on 'waist circumference' might result in:
$$BMI = -7.53 + 0.39(\text{waist circumference}) - 0.05(\text{age})$$
Interpretation involves assessing the change in the outcome for a unit increase in a predictor, while holding other predictors constant. The intercept interpretation may not be meaningful if its value (e.g., zero waist circumference) is not plausible within the data's context [22](#page=22).
##### 3.1.1.2 Binary predictor variables
Binary categorical variables (e.g., gender coded as 1 for males and 2 for females) are included in the model. The coefficient for a binary predictor represents the difference in the outcome between the two groups, with one group serving as the reference category.
For instance, in a model predicting BMI:
$$BMI = -10.99 + 0.40(\text{waist circumference}) - 0.05(\text{age}) + 2.13(\text{gender})$$
If males are the reference category (coded as 1), the coefficient of 2.13 for gender indicates that the mean BMI for females (coded as 2) is 2.13 units higher than for males, controlling for waist circumference and age [23](#page=23).
##### 3.1.1.3 Categorical predictor variables with multiple categories
For categorical predictors with more than two levels (e.g., smoking status: non-smoker, ex-smoker, current smoker), dummy variables are created. One category is designated as the reference category (e.g., non-smoker), and dummy variables are generated for each of the remaining categories. Each dummy variable receives a code of 1 if an individual belongs to that category and 0 otherwise.
* **Example of dummy variable creation (non-smoker as reference):**
* Non-smoker: ex-smoker dummy = 0, current smoker dummy = 0
* Ex-smoker: ex-smoker dummy = 1, current smoker dummy = 0
* Current smoker: ex-smoker dummy = 0, current smoker dummy = 1
The regression model then includes these dummy variables, and their coefficients are interpreted as the difference in the outcome compared to the reference category. Some statistical software can handle categorical variables directly without manual dummy variable creation [24](#page=24).
##### 3.1.1.4 Ordinal predictor variables
Ordinal variables can be handled in two ways:
1. **Treated as continuous:** Each level increase in the ordinal variable is assigned a numerical value, and the coefficient reflects the change in the outcome for each unit increase. For example, if pain is coded 0, 1, 2, 3, 4, the coefficient represents the change in outcome per pain level increment [25](#page=25).
2. **Treated as categorical:** Similar to other categorical variables, one level is set as the reference category, and coefficients represent differences between other levels and the reference [25](#page=25).
### 3.2 Checking model fit
Assessing the quality and predictive power of a multiple linear regression model involves several key indicators:
* **Adjusted $R^2$:** This value indicates the proportion of variance in the dependent variable explained by the model, adjusted for the number of predictors and sample size. A higher adjusted $R^2$ suggests a better fit [25](#page=25).
* **ANOVA model significance:** A statistically significant p-value from the ANOVA table associated with the regression model indicates that the model as a whole is a good fit for the data [25](#page=25).
* **Model assumptions:** Verifying that the assumptions of multiple linear regression are met is crucial for the validity of the results [25](#page=25).
### 3.3 Assumptions of multiple linear regression
The assumptions for multiple linear regression are similar to simple linear regression, with the critical addition of assessing multicollinearity [26](#page=26).
* **No multicollinearity:** This occurs when two or more independent variables are highly correlated, meaning they measure similar concepts.
* **Detection methods:**
* **Correlation coefficients:** A correlation matrix of all predictor variables can reveal strong pairwise correlations (e.g., magnitude $\ge 0.80$) [26](#page=26).
* **Variance Inflation Factor (VIF):** VIF values should ideally be below 5.0, and generally below 10.0. High VIF values indicate multicollinearity [26](#page=26).
* **Linearity:** The relationship between each independent variable and the dependent variable is linear. (Implicit from simple linear regression assumptions) [26](#page=26).
* **Independence of errors:** Residuals are independent of each other. (Implicit from simple linear regression assumptions) [26](#page=26).
* **Homoscedasticity:** The variance of the errors is constant across all levels of the independent variables. (Implicit from simple linear regression assumptions) [26](#page=26).
* **Normality of errors:** The residuals are normally distributed. (Implicit from simple linear regression assumptions) [26](#page=26).
### 3.4 Model building strategies
Choosing which variables to include in the final regression model is a significant challenge. The strategy depends on the primary goal of the model (prediction vs. adjustment) [27](#page=27).
#### 3.4.1 Model building for predictive purposes
The aim is to achieve the best predictive model, balancing predictive capability with model parsimony (fewer predictors) [27](#page=27).
* **Approaches:**
1. **Automatic variable selection methods:** The software automatically selects predictors based on statistical criteria. These are generally recommended when there is little prior knowledge about which variables are relevant [27](#page=27).
* **Forward selection:** Starts with an empty model and adds variables one by one based on statistical significance (lowest p-value). Variables are assessed for significance after each addition and can be removed [27](#page=27).
* **Backward elimination:** Starts with all variables in the model and removes the least significant variable at each step (highest p-value) until no more significant variables can be removed. Variables with a p-value < 0.05 are typically retained [28](#page=28).
* **Stepwise selection:** A combination of forward and backward methods. Variables are added or removed, and then variables already in the model are re-evaluated for significance. This ensures all variables in the final model have a p-value < 0.05 [28](#page=28).
* **Criterion for selection/removal:** Change in adjusted $R^2$ upon entry or removal of a variable can also guide the process [28](#page=28).
2. **Manual variable selection method:** The researcher decides which variables to include, often based on prior knowledge, literature review, or hypotheses. This is suitable when the goal is to control for confounders or adjust for other factors. This method is often referred to as the "enter" method in software like SPSS [28](#page=28).
#### 3.4.2 Model building for non-predictive purposes (adjustment)
When the primary goal is to adjust for confounding factors, several strategies can be employed:
* **Automatic selection methods:** Similar to predictive purposes, but may risk omitting clinically important variables if they don't reach statistical significance [29](#page=29).
* **Include variables significant in univariate regression:** Select variables with a p-value below a certain threshold (e.g., 0.2) in initial simple regression analyses [29](#page=29).
* **Include all studied or clinically important variables:** Based on literature review and researcher expertise, all relevant variables or those deemed clinically important are included [29](#page=29).
* **Combination of methods:** A mixed approach incorporating elements from the above strategies [29](#page=29).
### 3.5 Reporting regression output
When reporting multiple linear regression results, several components should be included:
1. **Constant/Intercept:** Important for predictive models [30](#page=30).
2. **Coefficients (slopes):** Essential for interpretation, indicating the magnitude and direction of the relationship between predictors and the outcome [30](#page=30).
3. **P-values:** Indicate the statistical significance of each predictor [30](#page=30).
4. **95% Confidence Intervals (CI) of coefficients:** Provide a range of plausible values for the true coefficient. A CI containing zero suggests non-significance [30](#page=30).
5. **Model fit statistics:** For predictive models, report adjusted $R^2$ and the model's overall p-value [30](#page=30).
6. **Model diagnostics:** Report findings related to assumption checks [30](#page=30).
It is also common practice to present results from simple linear regressions alongside multiple regression results in the same table for comparison [30](#page=30).
**Example reporting table:**
| Variables | Coefficients | P-value | 95% CI of the coefficients |
| :--------------------- | :----------- | :------ | :------------------------- |
| Waist circumference | 0.40 | <0.001 | 0.39, 0.40 |
| Age | -0.05 | <0.001 | -0.06, -0.05 |
| Gender (Female vs Male)| 2.13 | <0.001 | 2.00, 2.26 |
| Constant | -8.86 | <0.001 | -9.27, -8.45 |
| Adjusted $R^2$ | 0.874 | | |
> **Tip:** While online calculators exist for simpler multiple regression models, complex analyses typically require statistical software such as SPSS, R, or Stata [30](#page=30).
---
# Logistic regression
Logistic regression is a statistical method used to model the relationship between a binary outcome variable and one or more predictor variables [31](#page=31).
## 4. Logistic regression
Logistic regression is a statistical technique used to analyze situations where the dependent variable is binary (dichotomous), meaning it can take on only two possible outcomes. This is in contrast to linear regression, where the outcome variable is numerical and continuous. Logistic regression is widely used in medical and epidemiological studies to investigate potential risk factors for diseases or complications, where the outcome could be "disease/no disease," "complication/no complication," or "recurrence/no recurrence" [31](#page=31) [35](#page=35).
### 4.1 Simple logistic regression
Simple logistic regression examines the association between a single predictor variable and a binary outcome [31](#page=31).
#### 4.1.1 Types of variables
* **Dependent variable (outcome, y):** Must be a binary variable [31](#page=31).
* **Independent variable (predictor, x):** Can be numerical, ordinal, or categorical [31](#page=31).
#### 4.1.2 The logistic regression equation
The core of logistic regression involves a logarithmic transformation. The probability of the outcome ($p$) is modeled using the following equation:
$$ \log\left(\frac{p}{1-p}\right) = b_0 + b_1 x $$ [31](#page=31).
Where:
* $p$ represents the probability of the outcome occurring (e.g., having a disease) [31](#page=31).
* $1-p$ represents the probability of the outcome not occurring (e.g., not having a disease) [31](#page=31).
* $b_0$ is the intercept [31](#page=31).
* $b_1$ is the regression coefficient for the predictor variable $x$ [31](#page=31).
> **Tip:** The regression equation itself is rarely used directly in medical practice. Instead, the focus is on interpreting the coefficients after they have been transformed [31](#page=31).
#### 4.1.3 Odds ratios (OR)
The regression coefficient ($b_1$) is back-transformed from the log scale to the natural scale to yield the odds ratio (OR). The OR is a crucial measure of association in logistic regression. It indicates the change in odds of the outcome occurring for a one-unit increase in the predictor variable [31](#page=31).
* If there is no association between the predictor and the outcome, the coefficient ($b$) will be 0, and the OR (exp(b)) will be 1 [31](#page=31).
> **Tip:** Be careful when interpreting logistic regression output; ensure you are looking at the coefficient ($b$) or the exponential of the coefficient (exp(b)), which is the odds ratio [31](#page=31).
#### 4.1.4 Interpretation of odds ratios
The interpretation of the OR depends on the type of predictor variable:
* **Continuous predictor variable:**
* An OR > 1 indicates that as the predictor increases, the odds of the outcome occurring also increase [34](#page=34).
* An OR < 1 indicates that as the predictor increases, the odds of the outcome occurring decrease [34](#page=34).
* An OR = 1 indicates no association or change [34](#page=34).
> **Example:** If the OR for waist circumference and diabetes is 1.04, it means that for each one-unit increase in waist circumference (e.g., 1 cm), the odds of being diabetic increase multiplicatively by 1.04. An increase of 3 units would mean the odds increase by $1.04^3$, not $1.04 \times 3$ [32](#page=32).
* **Binary predictor variable:**
* The OR compares the odds of the outcome in the group coded as '1' to the odds in the group coded as '0' [34](#page=34).
* An OR > 1 means the odds of the outcome are higher in the group coded as '1' [34](#page=34).
* An OR < 1 means the odds of the outcome are lower in the group coded as '1' [34](#page=34).
* An OR = 1 means no association [34](#page=34).
> **Example:** If hypertension is coded as 0=no and 1=yes, and the OR for hypertension and diabetes is 2.30, it means that for patients with hypertension, the odds of having diabetes are 2.3 times the odds of having diabetes among patients without hypertension [32](#page=32).
* **Categorical predictor variable:**
* The OR compares the odds of the outcome in each category to a designated **reference category**. The number of ORs reported is typically one less than the number of categories [34](#page=34).
* An OR > 1 for a category means the odds of the outcome are higher in that category compared to the reference category [34](#page=34).
* An OR < 1 means the odds of the outcome are lower in that category compared to the reference category [34](#page=34).
* An OR = 1 means no difference from the reference category [34](#page=34).
> **Example:** In a study of smoking status and bladder cancer, if "never smokers" is the reference category:
> * Occasional smokers with an OR of 1.5 have 1.5 times the odds of bladder cancer compared to never smokers [33](#page=33).
> * Former smokers with an OR of 2.3 have 2.3 times the odds of bladder cancer compared to never smokers [33](#page=33).
> * Current smokers with an OR of 5.2 have 5.2 times the odds of bladder cancer compared to never smokers [33](#page=33).
#### 4.1.5 Differences from linear regression
| Feature | Linear Regression | Logistic Regression |
| :------------------ | :--------------------------------------------------------- | :------------------------------------------------------------------ |
| **Outcome Variable**| Numerical (continuous) | Binary |
| **Interpretation** | Coefficient ($b$) | Exponential of coefficient (exp(b)), which is the Odds Ratio (OR) | [35](#page=35).
| **CI crosses...** | 0 (indicates non-significance) | 1 (indicates non-significance) | [35](#page=35).
> **Example:**
> * Linear: $b=1.08$ (95% CI 0.77 to 1.40) is statistically significant.
> * Linear: $b=1.08$ (95% CI -0.98 to 1.36) is statistically non-significant.
> * Logistic: OR = 0.60 (95% CI 0.80 to 0.98) is statistically significant.
> * Logistic: OR = 0.80 (95% CI 0.89 to 5.55) is statistically non-significant [35](#page=35).
#### 4.1.6 Checking model fit
Assessing the quality of a logistic regression model is crucial.
* **Pseudo R² values:** Similar to R² in linear regression, Cox and Snell R² and Nagelkerke R² are reported to indicate the model's explanatory power [35](#page=35).
* **Hosmer–Lemeshow test:** A significant p-value ($p < 0.05$) from this test indicates that the model is not a good fit for the data [35](#page=35).
> **Tip:** For categorical predictor variables, if the 95% confidence interval of the OR contains 1, the association is considered not statistically significant [33](#page=33).
### 4.2 Multiple logistic regression
Multiple logistic regression extends simple logistic regression by including more than one predictor variable simultaneously. This allows for the assessment of the association between each predictor and the outcome while controlling for the effects of other variables in the model (#page=31, 36) [31](#page=31) [36](#page=36).
#### 4.2.1 Types of variables
* **Dependent variable (outcome, y):** One binary variable [36](#page=36).
* **Independent variables (predictors):** Multiple variables, which can be numerical, ordinal, or categorical [36](#page=36).
#### 4.2.2 The regression equation
The equation for multiple logistic regression is an extension of the simple model:
$$ \log\left(\frac{p}{1-p}\right) = b_0 + b_1 x_1 + b_2 x_2 + \dots $$ [36](#page=36).
Where $b_1, b_2, \dots$ are the regression coefficients for predictor variables $x_1, x_2, \dots$ respectively. These coefficients are back-transformed to yield odds ratios, which are interpreted similarly to those in simple logistic regression [36](#page=36).
#### 4.2.3 Crude versus adjusted odds ratios
* **Crude (unadjusted) odds ratios:** These result from simple logistic regression and measure the association between two variables without accounting for any other factors [36](#page=36).
* **Adjusted odds ratios:** These result from multiple logistic regression and measure the association between a predictor and the outcome while controlling for other variables included in the model [36](#page=36).
> **Tip:** Reporting both unadjusted and adjusted ORs can be very informative. A significant change in an OR after adjustment might suggest confounding or effect modification [36](#page=36).
#### 4.2.4 What to report from regression output
When reporting logistic regression results, it is essential to include:
1. The ORs (unadjusted) from simple regression [37](#page=37).
2. The ORs (adjusted) from multiple regression [37](#page=37).
3. The 95% CI for the adjusted ORs [37](#page=37).
4. The p-value for the significance of the association [37](#page=37).
Sometimes, unadjusted ORs are also reported with their corresponding 95% CI and p-value [37](#page=37).
> **Example:** In a study on diabetes, hypertension showed a significant association (OR=2.30) in simple regression. However, in multiple regression, after adjusting for Age and BMI, the OR for hypertension dropped to 1.08 and was no longer statistically significant (95% CI 0.87, 1.33; p=0.500). This suggests that the initial observed effect of hypertension might have been partly due to its association with BMI and age [37](#page=37).
---
## Common mistakes to avoid
- Review all topics thoroughly before exams
- Pay attention to formulas and key definitions
- Practice with examples provided in each section
- Don't memorize without understanding the underlying concepts
Glossary
| Term | Definition |
|------|------------|
| Correlation coefficient | A statistical measure that quantifies the strength and direction of the linear relationship between two variables. It ranges from -1 to 1. |
| Pearson's correlation (r) | A statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables that are normally distributed. |
| Spearman's correlation (ρ) | A non-parametric statistical measure used to assess the strength and direction of a monotonic relationship between two ranked variables or two continuous variables that do not meet the normality assumption. |
| Scatterplot | A graphical representation used to display the relationship between two continuous variables, where each point represents a pair of values for the variables. |
| Coefficient of determination (R²) | A statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is the square of the correlation coefficient in simple linear regression. |
| Regression | A statistical method used to study and quantify the relationship between a dependent variable and one or more independent variables. |
| Dependent variable (Y) | The outcome variable that is being predicted or explained by the independent variable(s). Also known as the response variable. |
| Independent variable (X) | The variable that is used to predict or explain the dependent variable. Also known as the predictor or explanatory variable. |
| Linear regression | A statistical method used when the outcome variable is continuous, aiming to model the relationship between variables using a linear equation. |
| Logistic regression | A statistical method used when the outcome variable is binary (dichotomous), modeling the probability of the outcome occurring based on one or more predictor variables. |
| Intercept (b₀) | In a regression equation, the value of the dependent variable when the independent variable(s) are all zero. It represents the baseline value. |
| Slope (b₁) | In a regression equation, the rate of change in the dependent variable for a one-unit increase in the independent variable. |
| Residual | The difference between an observed value of the dependent variable and its predicted value from the regression model. It represents the error or unexplained variation. |
| Multicollinearity | A phenomenon in multiple regression where two or more independent variables are highly correlated with each other, potentially affecting the stability and interpretation of the regression coefficients. |
| Odds Ratio (OR) | A measure of association used in logistic regression that quantifies how much the odds of an outcome occurring change for a unit change in a predictor variable. |
| Adjusted Odds Ratio | An odds ratio calculated from a multiple logistic regression model, which represents the association between an exposure and an outcome after controlling for the effects of other variables in the model. |
| Crude Odds Ratio | An odds ratio calculated from a simple logistic regression model, representing the association between an exposure and an outcome without adjusting for any other variables. |