Cover

week 6 pdf.pdf

Summary

# Comparison of quantitative methods This section compares various quantitative analysis techniques, focusing on how variable types dictate the choice between linear regression, logistic regression, and ANOVA [2](#page=2). ### 1.1 Key considerations for choosing analysis techniques The selection of appropriate quantitative analysis techniques is critically dependent on the nature of the dependent and independent variables involved. A fundamental distinction is made between metric variables, which represent quantities, and categorical variables, which represent groups [2](#page=2). ### 1.2 Regression analysis: linear vs. logistic #### 1.2.1 Linear regression Linear regression is suitable when the dependent variable is metric. The independent variables in a linear regression can be either metric or categorical. However, if categorical independent variables are used, they must be appropriately coded, typically as dummy variables [2](#page=2). #### 1.2.2 Logistic regression Logistic regression is the preferred method when the dependent variable is categorical [2](#page=2). * **Binary logistic regression:** This is used when the dependent variable is nominal and represents two distinct groups (e.g., a dummy variable coded as 0 or 1) [2](#page=2). * **Multinomial logistic regression:** This is applied when the dependent variable has more than two categories. The course will focus on binary logistic regression, with multinomial logistic regression being beyond its scope [2](#page=2) [3](#page=3). Similar to linear regression, the independent variables in logistic regression can be metric or categorical, provided the categorical variables are encoded as dummy variables [3](#page=3). > **Tip:** Categorical variables must always be converted to dummy (binary: 1/0) variables for use in regression analyses [2](#page=2). #### 1.2.3 Variable types in regression | Analysis Technique | Dependent variable (Y) | Independent variable (X) | | :----------------- | :--------------------- | :----------------------- | | Linear regression | Metric | Metric or categorical* | | Logistic regression (Binary) | Nominal (2 groups) | Metric or categorical* | | Logistic regression (Multinomial) | Nominal (>2 groups) | Metric or categorical* | *Note: Categorical variables must be converted to dummy (binary: 1/0) variables [2](#page=2). ### 1.3 Analysis of Variance (ANOVA) Analysis of Variance (ANOVA) is another technique discussed, which also requires a metric dependent variable. The key distinction for ANOVA lies in its typical independent variable, which is usually categorical. ANOVA is frequently employed in experimental research, where the different experimental groups serve as the independent variable [4](#page=4). > **Tip:** Understanding the measurement level of your variables is paramount for selecting the correct analysis technique, especially in your own research such as a master's thesis [4](#page=4). #### 1.3.1 Variable types in ANOVA | Analysis Technique | Dependent variable (Y) | Independent variable (X) | | :----------------- | :--------------------- | :----------------------- | | ANOVA | Metric | Categorical | ### 1.4 General principle of variable measurement The type of dependent and independent variables is a primary determinant of which analysis techniques can be applied. It is often advantageous to measure variables in ways that result in metric data whenever possible, as this typically offers a wider range of analytical possibilities. Exercises are available to help clarify the relationship between variable measurement and analysis technique selection [4](#page=4). --- # Research example: Pizza restaurant customer satisfaction This section details a research study investigating factors influencing customer satisfaction in a pizza restaurant, defining its dependent, independent, and control variables, and outlining the measurement methodology using a questionnaire [6](#page=6) [9](#page=9). ### 2.1 Study objectives and factors The primary objective of this research is to determine the extent to which various factors influence overall customer satisfaction at a pizza restaurant. Based on discussions with employees, five specific factors were identified as potentially playing a significant role [6](#page=6): * Reception * Service * Waiting time * Food quality * Price It is hypothesized that increased satisfaction with each of these factors will contribute to higher overall satisfaction with the restaurant [6](#page=6). ### 2.2 Measurement of variables Customer satisfaction was measured using a questionnaire [7](#page=7). #### 2.2.1 Overall satisfaction Overall satisfaction with the pizza restaurant was measured using a 7-point Likert scale, where a higher score indicates greater satisfaction [7](#page=7). #### 2.2.2 Factor-specific satisfaction Satisfaction with each of the five identified factors (reception, service, waiting time, food quality, and price) was also assessed using the same 7-point Likert scale. The scale ranges from 1 (very dissatisfied) to 7 (very satisfied), with 4 representing a neutral position (neither dissatisfied nor satisfied) [7](#page=7). #### 2.2.3 Demographic variables In addition to satisfaction measures, demographic variables were collected. These included [8](#page=8): * **Gender:** Collected as a categorical variable (Man or Woman) [8](#page=8). * **Age:** Collected as a categorical variable, with respondents indicating their age range rather than their exact age. The categories were < 20, 20-35, 36-50, and > 50 [8](#page=8). A total of 107 customers completed the questionnaire, resulting in 107 observations for the study. The data is available in a STATA datafile named "Pizza\_incl\_dummies.dta" [8](#page=8). ### 2.3 Variable identification #### 2.3.1 Dependent variable The dependent variable is the primary outcome that the research aims to explain. In this study, the dependent variable is **overall customer satisfaction**, referred to as "satisfaction" in the dataset. Although Likert scale variables are technically categorical, they are often treated as metric variables due to their formulation [9](#page=9). #### 2.3.2 Independent variables Independent variables are those hypothesized to influence the dependent variable. In this research, the independent variables are the five factors being investigated for their effect on overall satisfaction [9](#page=9): * Reception * Service * Waiting time * Food quality * Price These variables were measured using 7-point Likert scales and are thus considered metric variables for analysis [9](#page=9). #### 2.3.3 Control variables Control variables are included in the model because they are known or suspected to influence the dependent variable, but the primary interest of the study is not in their specific effects. Including control variables helps to avoid omitted variable bias, which could distort the estimates of the independent variables' effects [9](#page=9). In this study, the control variables are: * Age * Gender Both age and gender are categorical variables [9](#page=9). > **Tip:** The distinction between independent and control variables is primarily conceptual. While theoretical hypotheses are typically formulated for independent variables, this is not usually the case for control variables. However, both types are entered as independent variables in the statistical analysis [10](#page=10). ### 2.4 Analytical approach Since the dependent variable (overall satisfaction) is metric, linear regression analysis is deemed appropriate for this study. As age and gender are categorical variables, they must first be encoded into dummy variables before being included as independent variables in the linear regression model [10](#page=10). --- # Dummy variable creation and usage in STATA This section details the process of transforming categorical variables into dummy (or indicator) variables within STATA for use in statistical analysis, covering manual creation, automatic generation, and the `i.` prefix. ### 3.1 Introduction to dummy variables Dummy variables, also known as indicator variables or 0/1 variables, are essential for representing categorical independent and control variables in regression analyses. These variables take a value of 1 if a specific category is present and 0 otherwise [11](#page=11). ### 3.2 The dummy variable trap A critical principle when creating dummy variables is to include one fewer dummy variable than the total number of categories in the original variable. Including a number of dummy variables equal to the number of categories leads to perfect multicollinearity, known as the "dummy variable trap". This occurs because one dummy variable can be perfectly predicted from the others, rendering the model unestimatable [11](#page=11) [12](#page=12). * **Rule:** Number of dummy variables = Number of categories - 1 [11](#page=11). ### 3.3 Creating dummy variables manually and with `tabulate, generate` STATA offers straightforward methods for creating dummy variables. #### 3.3.1 Manual recoding Categorical variables that are not already in a 0/1 format, such as a 'gender' variable coded as 1 for male and 2 for female, must be recoded. For a binary variable like gender, two dummy variables could conceptually be created: one for 'male' (1 if male, 0 otherwise) and one for 'female' (1 if female, 0 otherwise). However, as per the dummy variable trap rule, only one of these should be included in a regression model [11](#page=11). #### 3.3.2 Using the `tabulate, generate` command The `tabulate` command with the `generate` option provides an automated way to create dummy variables [13](#page=13). * **Functionality:** The `tabulate` command displays the frequencies of categories within a variable. When combined with the `generate` option, it automatically creates a new dummy variable for each category of the specified variable [13](#page=13). * **Example:** For a 'gender' variable with categories 'male' and 'female', `tabulate gender, generate` will create two new variables, typically named `gender1` and `gender2`. `gender1` will be 1 if the respondent is male and 0 otherwise, while `gender2` will be 1 if the respondent is female and 0 otherwise [14](#page=14). * **Example with multiple categories:** If the 'age' variable has four categories, using `tabulate age, generate` will create four dummy variables: `age1`, `age2`, `age3`, and `age4` [15](#page=15). > **Tip:** While `tabulate, generate` is convenient, remember that you will typically need to exclude one of the generated dummy variables from your regression model to avoid the dummy variable trap. ### 3.4 Using the `i.` prefix in STATA STATA provides a more integrated approach to handle categorical variables without explicit dummy variable creation in the dataset using the `i.` prefix [16](#page=16). * **Functionality:** By preceding a variable name with `i.` (e.g., `i.age` or `i.gender`) in almost any STATA command, you instruct STATA to treat that variable as categorical. STATA will then automatically generate and include the correct number of dummy variables in the analysis, inherently handling the dummy variable trap by omitting one category as a reference group [16](#page=16). * **Advantages:** This method simplifies the process as it removes the need to manually create and manage dummy variables in the data editor, streamlining the analysis workflow. > **Tip:** Using the `i.` prefix is generally the preferred method for incorporating categorical variables in STATA regressions, as it is less prone to user error regarding the dummy variable trap. --- # Linear regression assumptions and testing Performing a linear regression analysis requires several key assumptions to be met for the results to be valid and interpretable. This section outlines these nine assumptions, their importance, and methods for testing them [20](#page=20). ### 4.1 The nine assumptions of linear regression The following assumptions are crucial for a robust linear regression analysis [20](#page=20): 1. **Causality:** The independent variables (and control variables) must influence the dependent variable, not the other way around [21](#page=21). 2. **Inclusion of all relevant variables:** All important variables that could affect the dependent variable must be included in the model to avoid omitted variable bias [22](#page=22). 3. **Metric dependent variable:** The dependent variable must be measured on a metric scale [24](#page=24). 4. **Linear relationship:** There should be a linear relationship between each independent variable and the dependent variable [25](#page=25). 5. **Additive relationship:** The relationship between the dependent and independent variables should be additive, meaning the predicted dependent variable is the sum of the weighted independent variables [26](#page=26). 6. **Residual properties:** Residuals must be independent, normally distributed, homoscedastic (constant variance), and without autocorrelation [29](#page=29). 7. **Sufficient observations:** An adequate number of observations must be available relative to the number of estimated parameters [44](#page=44). 8. **No multicollinearity:** There should be no perfect or high linear relationship between independent variables [45](#page=45). 9. **No extreme values:** The dataset should not contain extreme observations or outliers that unduly influence parameter estimates [49](#page=49). ### 4.2 Detailed examination and testing of assumptions #### 4.2.1 Causality The causality assumption posits that independent variables directly influence the dependent variable. This is challenging to prove, especially when data is collected simultaneously for all variables, as reverse causality or consistency bias (respondents aligning answers to appear consistent) may occur. In practice, this assumption is often made based on theoretical grounds or the experimental design of data collection [21](#page=21). #### 4.2.2 Inclusion of all relevant variables Omitted variable bias occurs when crucial independent variables are excluded from the model, potentially leading to biased parameter estimates. While difficult to test directly, systematic patterns in residual plots can indicate the presence of omitted variables. The `rvfplot` command in STATA can be used to visualize these patterns [22](#page=22) [23](#page=23). > **Tip:** Rely on intuition, existing theory, and academic literature to identify potentially relevant variables to include in the model [22](#page=22). #### 4.2.3 Metric dependent variable This assumption requires the dependent variable to be measured on a continuous or interval scale. Likert scales, often treated as metric in practice, satisfy this assumption if the context allows for such an interpretation [24](#page=24). #### 4.2.4 Linear relationship Linear regression models assume a linear association between each independent variable and the dependent variable. Scatter plots can be used to visually inspect these relationships [25](#page=25). #### 4.2.5 Additive relationship The model assumes that the effects of independent variables on the dependent variable are additive. This means the overall predicted outcome is the sum of the weighted contributions of each independent variable, rather than multiplicative interactions, unless interaction terms are explicitly included [26](#page=26). #### 4.2.6 Residual properties This is a composite assumption encompassing four key aspects of the residual terms [29](#page=29): ##### 4.2.6.1 Independence of residuals Residuals should be independent of each other. This is typically met when observations are collected independently. Autocorrelation can be an issue, particularly in time-series data or with clustered cross-sectional data [29](#page=29) [40](#page=40). ##### 4.2.6.2 Normality of residuals Residuals are expected to follow a normal distribution. This can be tested graphically using histograms and kernel density plots, and statistically using tests like the Shapiro-Wilk test and skewness/kurtosis tests [30](#page=30) [31](#page=31). * **Testing in STATA:** 1. Store residuals: `predict res` 2. Visualize distribution: `histogram res` or `kdensity res, normal` 3. Statistical tests: `sktest res` [30](#page=30) [31](#page=31). * **Remedies for non-normality:** * Investigate and fix underlying model problems (e.g., omitted variables, wrong functional form) [32](#page=32). * Transform the dependent variable (e.g., logarithm, square root) to achieve a more normal distribution, keeping in mind this alters interpretation [32](#page=32). * For large sample sizes, violations of normality are often less critical as parameters and standard errors are less affected [33](#page=33). ##### 4.2.6.3 Homoscedasticity (constant variance of residuals) The variance of the residuals should be constant across all levels of the independent variables. Heteroscedasticity occurs when this variance changes, often appearing as a "fanned-out" pattern in residual plots [34](#page=34) [35](#page=35). * **Testing for heteroscedasticity:** * Graphical inspection using `rvfplot` for patterns in residual spread [36](#page=36). * Statistical tests such as the Breusch-Pagan and White tests. The null hypothesis for these tests is homoscedasticity [37](#page=37). * **Remedies for heteroscedasticity:** * Identify and correct underlying model issues (e.g., omitted variables, wrong functional form) [38](#page=38). * Transform the dependent variable, but be mindful of interpretation changes [38](#page=38). * Use robust standard errors, which can account for heteroscedasticity even in small samples [38](#page=38). ##### 4.2.6.4 Absence of autocorrelation Autocorrelation refers to the correlation between residual terms, typically observed in time-series data where observations from one period are correlated with those from previous periods. It can also occur in cross-sectional data with clustered observations [39](#page=39) [40](#page=40). #### 4.2.7 Sufficient observations A general guideline suggests having at least 20 observations per estimated parameter, though a minimum of 5 might suffice in some cases. A larger sample size, such as hundreds or thousands of observations, generally strengthens the reliability of the regression results [33](#page=33) [44](#page=44) [53](#page=53). #### 4.2.8 No multicollinearity Multicollinearity arises when independent variables are highly correlated with each other [44](#page=44). * **Perfect multicollinearity:** Occurs when one independent variable can be perfectly predicted by a linear combination of others. STATA automatically handles this by removing one of the perfectly correlated variables [45](#page=45). * **Imperfect multicollinearity:** Occurs when there is a high, but not perfect, correlation between an independent variable and a combination of others. This doesn't prevent estimation but inflates standard errors, reducing parameter estimate precision [45](#page=45). * **Testing for multicollinearity:** * **Pairwise correlation coefficients:** Examine correlations between all pairs of independent variables; values above 0.8 (or sometimes 0.6-0.7) may indicate a problem [46](#page=46). * **Variance Inflation Factors (VIFs):** For each independent variable, VIF measures how well other independent variables explain it. VIFs above 10 are considered severe, while values between 5 and 10 indicate substantial multicollinearity. Tolerance (1/VIF) below 0.10 (or 0.5) also signals multicollinearity [47](#page=47). * **Remedies for multicollinearity:** * Remove one of the highly correlated independent variables [48](#page=48). * Merge correlated variables (e.g., by calculating an average score if conceptually meaningful) [48](#page=48). * Utilize factor scores from a factor analysis if appropriate [48](#page=48). #### 4.2.9 No extreme values (outliers) Extreme observations, or outliers, are data points that disproportionately influence parameter estimates [49](#page=49). * **Detection methods:** * **Graphical inspection:** The `rvfplot` can reveal observations that deviate significantly from the main cluster of data [50](#page=50). * **Dfbeta:** This statistic measures how much a parameter estimate changes when a specific observation is excluded. Values with an absolute value greater than 1 are often considered influential [51](#page=51). * **Studentized residuals:** These indicate how much a residual term deviates from others. Observations with studentized residuals less than -3 or greater than 3 are flagged as potential outliers [52](#page=52). * **Handling extreme values:** * Investigate the cause of the outlier; if it's a data entry error, it should be corrected or removed. If it's a genuine but extreme measurement, careful consideration and transparent reporting are crucial [53](#page=53). * The impact of outliers is more problematic in small samples. In large samples, their influence is typically diminished [53](#page=53). * Ideally, the regression results should not change drastically if outliers are removed, indicating robustness [53](#page=53). --- # Interpreting linear regression results This section details how to interpret the output of a linear regression analysis, focusing on model fit and the significance and magnitude of independent variable coefficients. ### 5.1 Overview of the interpretation process The interpretation of linear regression results involves a three-step process: first, checking assumptions; second, assessing the overall meaningfulness or "model fit"; and third, interpreting the coefficients of individual independent variables. This guide focuses on steps two and three, assuming assumptions have been met or corrected. The analysis for this section is based on a dataset with 105 observations after excluding two outliers [60](#page=60) [61](#page=61) [65](#page=65). ### 5.2 Checking model meaningfulness (model fit) To determine if the estimated model as a whole is meaningful and explains a substantial portion of the variance in the dependent variable, two key metrics are examined: the F-statistic and R-squared. #### 5.2.1 F-statistic The F-statistic tests the null hypothesis that all regression coefficients (parameters of all independent variables) are not significantly different from zero ($H_0: b_1 = b_2 = \dots = 0$) [62](#page=62). * **Interpretation:** If the p-value associated with the F-statistic is less than 0.05, the null hypothesis is rejected. This indicates that at least one independent variable has a significant influence on the dependent variable, meaning the model is statistically significant [62](#page=62). #### 5.2.2 R-squared ($R^2$) R-squared represents the proportion of the variance in the dependent variable (Y) that is explained by the model. It ranges from 0 to 1, where 0 indicates no variance explained and 1 indicates 100% of the variance explained [63](#page=63). * **Interpretation:** A higher R-squared value indicates a better fit of the model to the data. For instance, an $R^2$ of 0.7556 means that 75.56% of the variance in overall satisfaction is explained by the model [63](#page=63). * **Disadvantage:** R-squared tends to increase (or stay the same) when additional independent variables are added, regardless of their actual contribution. This can lead to overly complex models that violate the KISS (Keep It Simple, Stupid) principle [63](#page=63). #### 5.2.3 Adjusted R-squared (Adj. $R^2$) Adjusted R-squared is a modified version of R-squared that penalizes the addition of unnecessary independent variables, making it useful for comparing models with different numbers of predictors [63](#page=63) [64](#page=64). * **Interpretation:** It is used to compare models: if including a variable increases adjusted R-squared, it suggests the variable adds substantial explanatory power beyond its complexity. If adjusted R-squared decreases, the variable's contribution does not compensate for the added complexity. The adjusted R-squared of a single model cannot be interpreted in isolation [64](#page=64). > **Tip:** The F-statistic and R-squared are crucial for assessing the overall meaningfulness of the model. A significant F-statistic and a sufficiently high R-squared indicate that the model is a good fit and can proceed to coefficient interpretation [65](#page=65). ### 5.3 Interpreting coefficients of independent variables After establishing the model's meaningfulness, the next step is to interpret the coefficients of each independent variable. This involves assessing their statistical significance and their practical meaning. #### 5.3.1 T-statistics and p-values The t-statistic and its associated p-value are used to test the null hypothesis that an individual coefficient is not significantly different from zero ($H_0: b_i = 0$) [66](#page=66). * **Interpretation:** * If the p-value is less than 0.05, the null hypothesis is rejected. This means the independent variable has a statistically significant effect on the dependent variable, and its estimated coefficient is significantly different from zero [66](#page=66). * If the p-value is higher than 0.05, the null hypothesis cannot be rejected. This indicates that the estimated coefficient does not differ significantly from zero, and the independent variable has no statistically significant effect on the dependent variable [66](#page=66). > **Example:** If a p-value for "waiting time" is greater than 0.05, its coefficient should not be interpreted as having a real effect, as it's not significantly different from zero [67](#page=67). #### 5.3.2 Ceteris Paribus interpretation For independent variables with statistically significant coefficients, their estimated coefficients can be interpreted under the "ceteris paribus" assumption [68](#page=68). * **Interpretation:** Ceteris paribus means "all other things being equal." If the independent variable (X) increases by one unit, the dependent variable (Y) increases by the value of its coefficient ($b_i$), assuming all other independent variables in the model remain constant [68](#page=68) [69](#page=69). > **Example:** If the coefficient for "reception" is 0.176 and is significant, it means that a one-unit increase in satisfaction with reception leads to a 0.176 increase in overall satisfaction, assuming other factors do not change [69](#page=69). #### 5.3.3 Interpretation of dummy variables Dummy variables are used to represent categorical independent variables. Their interpretation is always made in relation to a reference or base category, for which no dummy variable is included in the model [70](#page=70). * **Interpretation:** The coefficient of a dummy variable indicates the difference in the dependent variable between that category and the reference category, holding other variables constant [71](#page=71). > **Example:** If "male" is a dummy variable with a coefficient of -0.393 and "women" is the reference category, it means that overall satisfaction for men is 0.393 lower than for women, ceteris paribus [71](#page=71). > **Example:** When interpreting age categories, if "older than 50" is the reference category, a coefficient of 1.198 for the "less than 20" dummy variable means that respondents younger than 20 have 1.198 higher overall satisfaction compared to those over 50, ceteris paribus [72](#page=72). > **Tip:** To compare the satisfaction levels between different subgroups (e.g., comparing different age groups directly), you may need to change the reference category in your model [73](#page=73). #### 5.3.4 Interpretation with transformed variables If dependent or independent variables have been transformed (e.g., using logarithms), the interpretation of coefficients must reflect this transformation. For example, if $\log(Y)$ was used, coefficients are interpreted in relation to $\log(Y)$, not $Y$ [70](#page=70). #### 5.3.5 Beta coefficients (standardized coefficients) Beta coefficients (standardized coefficients) are obtained by standardizing the independent variables before running the regression (subtracting the mean and dividing by the standard deviation) [75](#page=75). * **Purpose:** Beta values are used to compare the relative strength or importance of different independent variables in their impact on the dependent variable, as they are unit-independent [75](#page=75). * **Interpretation:** The independent variable with the largest absolute beta coefficient has the strongest impact on the dependent variable, while the one with the smallest absolute beta coefficient has the weakest impact [75](#page=75). > **Example:** If the beta value for "age" is the highest among all independent variables, it indicates that age has the strongest effect on overall satisfaction. Similarly, if "service" has a higher beta than "reception," service has a greater impact [75](#page=75). --- # Model comparison for non-linear relationships This section details how to test for non-linear relationships, specifically quadratic ones, by comparing a linear model to one that includes a squared term, using statistical tests to assess significant model improvement [78](#page=78). ### 6.1 The rationale for testing non-linear relationships Theoretical considerations can suggest that the relationship between dependent and independent variables is not strictly linear. For instance, the relationship between purchase intention and price might be quadratic. This means that at very low prices, purchase intention could be low due to a lack of perceived quality or trust. As the price increases, purchase intention may rise up to a certain point (a peak). Beyond this peak, if the price continues to rise, purchase intention might decrease as the product becomes perceived as too expensive [78](#page=78). ### 6.2 Model comparison approach To test for such non-linear, specifically quadratic, relationships, a model comparison method is employed. This approach systematically compares two models: a "restricted model" and a "full model" [79](#page=79). #### 6.2.1 Restricted versus full model * **Restricted model:** This is a simpler model where one or more parameters are constrained to be zero. In the context of testing for a quadratic relationship, the restricted model would be a linear model that does not include the squared term of the independent variable [79](#page=79). * **Full model:** This is a more complex model that includes all the parameters of interest, including those potentially constrained in the restricted model. For a quadratic relationship test, the full model would include both the linear and the squared term of the independent variable [79](#page=79). #### 6.2.2 Steps in model comparison The model comparison process involves several key steps: **Step 1: Create new variables** First, create a new variable that represents the non-linear term. In the example of a quadratic relationship with price, this involves calculating "price squared" [79](#page=79) [80](#page=80). **Step 2: Run linear regression models** Perform two separate linear regression analyses: * One regression for the restricted model (without the quadratic term) [80](#page=80). * Another regression for the full model (with the quadratic term) [80](#page=80). > **Tip:** It is crucial to verify that all assumptions of linear regression are met for both models [81](#page=81). #### 6.2.3 Addressing multicollinearity A common issue that arises when adding squared terms is multicollinearity, where the independent variables are highly correlated. The "price" and "price squared" variables, for instance, will inherently have a strong linear relationship. High variance inflation factors (VIFs) are indicators of this problem [81](#page=81). **Solution: Standardization** A common technique to mitigate multicollinearity is to standardize the variables. This involves transforming the original variables into z-scores [82](#page=82). * Calculate the standardized price variable ($z_{\text{price}}$) [82](#page=82). * Then, calculate the square of this standardized price ($z_{\text{price}}^2$) [82](#page=82). * Finally, run the linear regression models using these standardized variables ($z_{\text{price}}$ and $z_{\text{price}}^2$). This standardization often reduces VIFs to acceptable levels (e.g., below 10) [82](#page=82) [83](#page=83). > **Tip:** While standardization helps with multicollinearity, it's still essential to check other regression assumptions for both the restricted and full models [83](#page=83). **Step 3: Assess significant improvement in model fit** The final step is to determine if adding the non-linear term (the squared variable) significantly improves the overall fit of the model [84](#page=84). * **Using F-statistics and p-values:** This is typically done by examining changes in the F-statistic and its associated p-value. A significant improvement in model fit indicates that the non-linear term contributes meaningfully to explaining the dependent variable [84](#page=84). * **Software implementation:** Commands like `nestreg` in statistical software (e.g., Stata) can perform incremental F-tests, showing the change in the F-statistic when a new variable is added to the model. The significance of this change is assessed by the p-value [84](#page=84). > **Procedure to determine whether there is a nonlinear effect:** > If the p-value for the change in the F-statistic is greater than 0.05, there is no significant improvement, and the relationship can be considered linear. If the p-value is less than 0.05, the improvement is significant, suggesting a non-linear relationship [84](#page=84). **Interpreting the results:** * The focus is on the last row of the `nestreg` output, which indicates the change in the F-statistic upon adding the squared term ($z_{\text{price}}^2$) [84](#page=84). * If the change in the F-statistic is small and its p-value is greater than 0.05, adding the squared term is not a significant improvement. This leads to the conclusion that the relationship between the independent and dependent variables is linear [84](#page=84). * The change in $R^2$ also provides insight; a very low change when adding the squared term further supports the conclusion of no significant non-linear effect [84](#page=84). * Additionally, the coefficient table for the full model can be examined. If the p-value for the t-test of the coefficient of the squared term is greater than 0.05, it indicates that the effect of the squared term is not statistically significant [85](#page=85). > **Example:** If, after adding $z_{\text{price}}^2$ to a model, the change in the F-statistic is 0.07 and its p-value is greater than 0.05, then the addition of $z_{\text{price}}^2$ does not significantly improve the model fit, and the relationship is concluded to be linear. Similarly, if the p-value for the $z_{\text{price}}^2$ coefficient in the full model is, for instance, 0.25 (which is > 0.05), then there is no significant indication of a quadratic relationship [84](#page=84) [85](#page=85). --- ## Common mistakes to avoid - Review all topics thoroughly before exams - Pay attention to formulas and key definitions - Practice with examples provided in each section - Don't memorize without understanding the underlying concepts

Glossary

| Term | Definition | |------|------------| | Metric variables | Variables that represent quantities or numbers that can be used in calculations, such as height, weight, or temperature. | | Categorical variables | Variables that represent groups or categories, such as gender (male/female) or educational attainment (high school, bachelor's, master's). | | Linear regression | A statistical method used to model the relationship between a dependent variable and one or more independent variables, assuming a linear relationship. | | Logistic regression | A statistical method used to model the probability of a binary outcome (e.g., yes/no, success/failure) based on one or more predictor variables. | | Dependent variable | The variable that is being predicted or explained in a statistical model; its value is assumed to depend on the independent variables. | | Independent variable | A variable that is manipulated or measured to observe its effect on the dependent variable; it is used to explain or predict the dependent variable. | | Dummy variable | A binary variable (coded as 0 or 1) used to represent a categorical variable in a regression analysis. Each category of the original variable, except for a reference category, is represented by a separate dummy variable. | | Binary logistic regression | A type of logistic regression used when the dependent variable has only two possible outcomes. | | Multinomial logistic regression | A type of logistic regression used when the dependent variable has more than two possible outcomes that are not ordered. | | Analysis of Variance (ANOVA) | A statistical technique used to compare the means of two or more groups by analyzing the variance within and between groups. | | Likert scale | A psychometric scale commonly used in questionnaires to measure attitudes or opinions. Respondents indicate their level of agreement or disagreement on a scale, typically with a neutral midpoint. | | Control variables | Variables included in a statistical model to account for their potential influence on the dependent variable, even if they are not the primary focus of the research. They help to isolate the effect of the independent variables and avoid omitted variable bias. | | Omitted variable bias | A bias in the estimated coefficients of a regression model that occurs when a relevant variable that influences both an included independent variable and the dependent variable is not included in the model. | | Dummy variable trap | A situation in regression analysis that occurs when a complete set of dummy variables representing a categorical variable is included in the model along with the intercept. This leads to perfect multicollinearity because one dummy variable can be perfectly predicted from the others and the intercept. | | Multicollinearity | A phenomenon in regression analysis where two or more independent variables are highly correlated with each other, making it difficult to determine the independent effect of each variable on the dependent variable. | | Variance Inflation Factor (VIF) | A measure used to detect the degree of multicollinearity in a regression analysis. A high VIF indicates that the variable is highly correlated with other independent variables in the model. | | Residuals | The differences between the observed values of the dependent variable and the values predicted by the regression model. They represent the unexplained variation in the model. | | Homoscedasticity | A statistical assumption in regression analysis where the variance of the residuals is constant across all levels of the independent variables. | | Heteroscedasticity | The opposite of homoscedasticity, where the variance of the residuals is not constant across all levels of the independent variables, often appearing as a fanning-out pattern in residual plots. | | Autocorrelation | A correlation between observations in a time series or clustered data, where an observation at one point in time or in one cluster is correlated with observations at other points in time or in the same cluster. | | Outliers | Observations that are significantly different from other observations in the dataset and can disproportionately influence the results of a statistical analysis. | | Studentized residuals | A type of residual that is standardized by dividing by an estimate of its standard deviation, making it easier to identify potential outliers. | | Dfbeta | A diagnostic statistic in regression analysis that measures the influence of a single observation on the estimated coefficients of the model. | | Model fit | An overall assessment of how well a statistical model represents the data, typically evaluated using metrics like R-squared and the F-statistic. | | R-squared ($R^2$) | A statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). | | Adjusted R-squared | A modified version of R-squared that adjusts for the number of predictors in the model, providing a more accurate measure of model fit when comparing models with different numbers of independent variables. | | Ceteris paribus | A Latin phrase meaning "other things being equal." In regression analysis, it refers to interpreting the effect of one variable while holding all other variables constant. | | Standardization | The process of transforming variables so that they have a mean of 0 and a standard deviation of 1. This is often done to compare coefficients of variables measured on different scales. | | Quadratic term | A term in a regression model that represents the square of an independent variable, used to model a non-linear (curved) relationship. | | Restricted model | In model comparison, the simpler model that has fewer parameters or constraints, often used as a baseline against which a more complex model is compared. | | Full model | In model comparison, the more complex model that includes additional parameters or terms compared to the restricted model. | | Nestreg command | A STATA command used to perform incremental F-tests for nested models, assessing whether adding a set of variables significantly improves the model's explanatory power. |