Cover

Stats.Sum.notes.pdf

Summary

# Basic R operations and data manipulation This section covers fundamental R operators, data storage mechanisms, and initial steps for loading data and packages in R. ### 1.1 Basic operators R utilizes several operators for performing arithmetic operations and storing data. #### 1.1.1 Arithmetic operators * **Addition and Subtraction:** The `+` or `-` operators are used for adding or subtracting values [1](#page=1). * **Multiplication:** The `*` operator is used for multiplication [1](#page=1). * **Division:** The `/` operator is used for division [1](#page=1). * **Exponentiation:** The `^` operator is used to raise a number to a power [1](#page=1). #### 1.1.2 Data storage operator * **Assignment:** The `<-` operator is used to store data or results into objects or variables [1](#page=1). > **Example:** To store the text "apples" into an object named `c`, you would use `c <- "apples"` [1](#page=1). > **Example:** To store a sequence of numbers (1, 2, 3) into an object named `c`, you would use `c <- c(1,2,3)` [1](#page=1). #### 1.1.3 Comments * The `#` symbol is used to ignore lines of code, effectively making them comments [1](#page=1). ### 1.2 Vectors Vectors are fundamental data structures in R that can hold multiple values, which can be either text (letters) or numbers [1](#page=1). * **Creating vectors:** The `c()` function (which stands for combine) is used to create vectors by listing the values within the parentheses [1](#page=1). * For text, values should be enclosed in double quotes (`""`) [1](#page=1). * For numbers, no quotes are needed [1](#page=1). > **Example:** To create a vector named `d` with the numeric values 3, 7, and 1, you would use `d <- c(3,7,1)` [1](#page=1). * **Accessing vector elements:** * To recall a specific value from a vector, you use square brackets `[]` with the index (position) of the element. For example, `d ` would recall the second data value from vector `d`, which is `7` [1](#page=1) [2](#page=2). ### 1.3 Data frames and indexing Data frames are tabular data structures in R. Specific elements within a data frame can be accessed using a specific indexing syntax. * **Accessing elements:** The syntax `dataframe[row, column]` is used to recall a specific value from a table [1](#page=1) [2](#page=2). > **Example:** `d ` would recall the element at the second row and third column of a data frame named `d` [1](#page=1) [2](#page=2) . ### 1.4 Loading data and packages R provides functions to load external data files and to utilize pre-written functionalities from packages. #### 1.4.1 Loading data files * **RData files:** The `load("filename.rdata")` command is used to load data stored in an RData file [1](#page=1). * **CSV files:** The `read.csv("filename.csv", row.names = 1)` command is used to import data from a CSV file. The `row.names = 1` argument correctly sets the first column as row names [1](#page=1). #### 1.4.2 Working directory management * **Checking the current directory:** The `getwd()` function returns the path of the current working directory [2](#page=2). * **Setting the working directory:** The `setwd("folder address")` command is used to change the current working directory to a specified folder address [2](#page=2). #### 1.4.3 Package management * **Installing packages:** The `install.packages("package name")` command is used to download and install a package from a repository [2](#page=2). * **Loading packages:** The `library("package name")` command makes the functions and data within an installed package available for use in the current R session [2](#page=2). ### 1.5 Investigating and manipulating data R offers various tools for examining and modifying the structure and content of data. #### 1.5.1 Data examination * **Subsetting data:** The `$` operator is used to extract a specific column from a data frame. For instance, `cars$mpg` retrieves the `mpg` column from the `cars` data frame [2](#page=2). * **Recalling column names:** The `names(dataframe)` function returns a vector containing the names of all columns in the specified data frame [2](#page=2). * **Viewing data subsets:** * `head(dataframe)` displays the column names and the first 6 rows of a data frame [2](#page=2). * `head(dataframe, n=x)` displays the column names and the first `x` rows of a data frame. This can be used to view all rows if `x` is set to the total number of rows using `nrow()` [2](#page=2). * **Counting dimensions:** * `nrow(dataframe)` returns the number of rows in a dataset [2](#page=2). * `ncol(dataframe)` returns the number of columns in a dataset [2](#page=2). * **Converting data types:** The `dataframe$column <- factor(dataframe$column)` command converts the data in a specified column into a factor class. This can also be applied to convert data into integer or numeric classes [2](#page=2). * **Checking data class:** The `class(dataframeUSDcolumn)` function is used to determine the data type (class) of a variable or column [2](#page=2). * **Filtering data:** The `[which(dataframeUSDcolumn == "category")]` syntax is used to select rows from a specific column that meet a certain condition or match a specific category [2](#page=2). #### 1.5.2 Graphical presentation * **Boxplots:** The `boxplot(dataframeUSDcolumn)` command generates a boxplot for the data in a specified column. > **Example:** `boxplot(miceUSDweight)` creates a boxplot visualizing the distribution of weights from the `mice` dataset [2](#page=2). --- # Data exploration and graphical presentation This section outlines essential methods for exploring datasets, including data subsetting, inspecting column names and dimensions, converting data types, and generating various graphical presentations such as boxplots and histograms. ### 2.1 Data examination Exploring a dataset is a crucial first step to understand its structure and content. Several functions and operators are used for this purpose. #### 2.1.1 Accessing and inspecting data * The `$` operator is used to access specific columns (variables) within a data frame. For example, `cars$mpg` retrieves the `mpg` column from the `cars` dataset [2](#page=2). * To recall a specific value within a table (data frame), you can use square bracket notation: `dataframe[row number, column number]` [2](#page=2). * `names(dataframe)` displays all column names of a data frame [2](#page=2). * `head(dataframe)` shows the column names and the top 6 rows of the data frame. You can specify the number of rows to display using `head(dataframe, n=x)` [2](#page=2). * `nrow(dataframe)` returns the number of rows in a dataset [2](#page=2). * `ncol(dataframe)` returns the number of columns in a dataset [2](#page=2). #### 2.1.2 Data type conversion * Data can be converted into a factor class using the command `dataframe$column <- factor(dataframe$column)`. This is also applicable for integer or numeric data types [2](#page=2). * To check the data type of a variable, use `class(dataframeUSDcolumn)` [2](#page=2). #### 2.1.3 Subsetting data based on conditions * You can select data from a specific column that meets certain criteria using `which(dataframeUSDcolumn == "category")`. This function returns the indices of the elements that satisfy the condition [2](#page=2). ### 2.2 Graphical presentation Visualizing data is vital for understanding distributions, relationships, and patterns. #### 2.2.1 Boxplots Boxplots are useful for visualizing the distribution of a numerical variable and identifying potential outliers. * A basic boxplot for a single variable is created with `boxplot(dataframe$column)`. For instance, `boxplot(mice$weight)` would generate a boxplot of mice weights [2](#page=2). * To create boxplots grouped by another variable, the syntax `boxplot(dataframe$column1 ~ dataframe$column2)` or `boxplot(dataframe$column2, data=dataframe)` is used. Here, `column1` is the response variable (dependent), and `column2` is the explanatory variable (used for grouping). For example, `boxplot(mice$weight, miceUSDgenotype)` would group the weights by genotype [3](#page=3). > **Tip:** When creating grouped boxplots, ensure the grouping variable is a factor or a categorical type for proper visualization. #### 2.2.2 Histograms Histograms display the distribution of a numerical variable by dividing the data into bins. * A basic histogram is generated using `hist(dataframeUSDcolumn)` [3](#page=3). * To create a histogram for a specific subset of data, such as a particular treatment group, you can combine `hist()` with the subsetting condition: `hist(dataframe$column[which(dataframe$treatment == "treatment group type")])`. For example, `hist(data1$weight[which(data1$treatment=="control")])` would show the distribution of weights only for the control group [3](#page=3). #### 2.2.3 Quantile-Quantile (Q-Q) plots Q-Q plots are used to assess whether a dataset follows a specified theoretical distribution, typically the normal distribution. * `qqnorm(dataframeUSDcolumn)` generates a Q-Q plot for the specified column [3](#page=3). * `qqline(dataframeUSDcolumn)` adds a reference line of best fit to the Q-Q plot [3](#page=3). #### 2.2.4 Scatter plots Scatter plots are used to visualize the relationship between two numerical variables. * The general syntax for a scatter plot is `plot(a ~ b, data = dataframe)`, where `b` is the independent/explanatory variable (plotted on the x-axis) and `a` is the dependent/response variable (plotted on the y-axis). The tilde symbol `~` signifies "is explained by" [3](#page=3). * Plot customization options include: * `xlab = "x-axis title"` to set the x-axis label [3](#page=3). * `ylab = "y-axis title"` to set the y-axis label [3](#page=3). * `main = "graph title"` to set the main title of the graph [3](#page=3). * `xlim = c(x, y)` to set the limits of the x-axis [3](#page=3). * `ylim = c(x, y)` to set the limits of the y-axis [3](#page=3). * Text can be added to any location on the plot using `text(coordinates, coordinates, "text")` [3](#page=3). #### 2.2.5 Pairwise scatter plots * `pairs(dataframe)` generates a matrix of scatter plots, showing pairwise relationships between all numerical columns in a data frame. This is a quick way to visualize multiple relationships simultaneously [3](#page=3). ### 2.3 Summarising data Summarizing data provides key statistical information about the variables in a dataset. * `summary(dataframe)` provides a comprehensive summary of all variables in the dataset, including minimum, maximum, mean, first quartile (Q1), median, and third quartile (Q3) [3](#page=3). * `summary(dataframe, column)` provides a summary for a specific column [3](#page=3). --- # Statistical summaries and correlation/regression This section details how to generate descriptive statistics for datasets and individual variables, including group-wise summaries, and introduces the concepts and functions for analyzing correlations and linear regressions. ### 3.1 Summarizing data #### 3.1.1 Overall dataset summary The `summary()` function in R provides a comprehensive overview of all variables within a dataset. For each column (variable), it typically outputs [3](#page=3) [4](#page=4): * **Minimum value:** The smallest observation in the dataset [3](#page=3). * **1st Quartile (Q1):** The value below which 25% of the data falls [3](#page=3). * **Median:** The middle value of the dataset when ordered [3](#page=3). * **Mean:** The average of all observations [3](#page=3). * **3rd Quartile (Q3):** The value below which 75% of the data falls [3](#page=3). * **Maximum value:** The largest observation in the dataset [3](#page=3). #### 3.1.2 Summary for a specific column To obtain a summary for a single column (variable) within a dataset, you can specify the column name within the `summary()` function [3](#page=3): `summary(dataframeUSDcolumn)` #### 3.1.3 Grouped summaries To calculate summary statistics for a variable, broken down by a grouping variable, the `describeBy()` function from the `psych` package is useful. This function provides detailed statistics for each group, including [4](#page=4): * Variance [4](#page=4). * Number of observations (`n`) [4](#page=4). * Mean [4](#page=4). * Standard deviation [4](#page=4). * Median [4](#page=4). * Trimmed mean [4](#page=4). * Median absolute deviation [4](#page=4). * Minimum and Maximum values [4](#page=4). * Range [4](#page=4). * Skewness and Kurtosis [4](#page=4). * Standard error of the mean [4](#page=4). The `aggregate()` function can also be used to calculate specific statistics for groups. * **Calculating Mean:** `aggregate(a ~ b, data = dataframe, FUN = mean)` [4](#page=4). * **Calculating Standard Deviation (SD):** `aggregate(a ~ b, data = dataframe, FUN = sd)` [4](#page=4). * **Calculating Standard Error of the Mean (SEM):** This requires a custom function: `aggregate(a ~ b, data = dataframe, FUN = function(x) sd(x) / sqrt(n))` where `n` is the number of observations in each group [4](#page=4). * **Calculating Confidence Interval (CI):** This also requires a custom function, using a multiplier (e.g., 2.571 for a t-distribution or 2 for a large sample size) and the SD and `n`: `aggregate(a ~ b, data = dataframe, FUN = function(x) CL = 2.571 * sd(x) / sqrt(n))` [4](#page=4). The `do.call()` function can be used to apply multiple aggregate functions at once and store the results in a data frame [24](#page=24) [4](#page=4). ### 3.2 Correlation and Regression Correlation and regression analysis are used to understand the relationship between two or more variables. #### 3.2.1 Correlation Correlation measures the strength and direction of the linear relationship between two variables [17](#page=17). * **Calculating correlation between two variables:** The `cor(a, b)` function calculates the Pearson correlation coefficient between variables `a` and `b` [4](#page=4). * **Calculating correlation for all pairs of variables in a dataframe:** The `cor(dataframe)` function computes the correlation matrix for all possible pairs of columns within the specified dataframe [17](#page=17) [4](#page=4). #### 3.2.2 Linear Regression Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. * **Fitting a linear model:** The `lm()` function is used to build a linear regression model. The syntax `lm(column1 ~ column2, data = dataframe)` specifies that `column1` (the response variable) is explained by `column2` (the explanatory variable) within the `dataframe`. The model equation is stored in an object (e.g., `fit`) [17](#page=17) [18](#page=18) [19](#page=19) [20](#page=20) [21](#page=21) [22](#page=22) [23](#page=23) [24](#page=24) [4](#page=4). * **Interpreting the regression model:** The `summary(fit)` function provides a detailed output of the regression analysis, including coefficients, standard errors, t-values, p-values, and the R-squared value [17](#page=17) [18](#page=18) [19](#page=19) [20](#page=20) [21](#page=21) [22](#page=22) [23](#page=23) [24](#page=24). * The "coefficients" section details the intercept and slope of the regression line [17](#page=17). * The **adjusted R-squared value** indicates how well the data fits the line, ranging from 0 to 1, where 1 represents a perfect fit. However, limitations exist: large variations around the line can result in a low R-squared value, and a high R-squared doesn't guarantee the line is a good fit if there are consistent differences [17](#page=17). * **Visualizing the regression line:** A scatter plot of the two variables can be created using `plot(column1 ~ column2, data = dataframe)`, and the fitted regression line can be added using `abline(fit)` [17](#page=17) [18](#page=18) [19](#page=19) [20](#page=20) [21](#page=21) [22](#page=22) [23](#page=23) [24](#page=24) [5](#page=5). * **Predicting new values:** The `predict()` function uses the fitted model to estimate the dependent variable for new, unseen independent variable values [17](#page=17) [18](#page=18) [19](#page=19) [20](#page=20) [21](#page=21) [22](#page=22) [23](#page=23) [24](#page=24). * A new data frame must be created for the input values: `newdata <- data.frame(column=c(value))` [17](#page=17) [5](#page=5). * The prediction can include confidence intervals: `predict(object, newdata, interval = "confidence")` [17](#page=17) [5](#page=5). * **Analyzing residuals:** Residuals (the difference between observed and predicted values) can be plotted against the fitted values to check model assumptions: `plot(fit$residuals ~ fit$fitted.values)`. A horizontal line at `y=0` should be added using `abline(h = 0)` [17](#page=17) [18](#page=18) [19](#page=19) [20](#page=20) [21](#page=21) [22](#page=22) [23](#page=23) [24](#page=24) [5](#page=5). #### 3.2.3 Plotting relationships * **Scatter plots:** Visualize the relationship between two variables. The format `plot(dependent_variable ~ independent_variable, data = dataframe)` is used [17](#page=17) [3](#page=3). * **Pairwise scatter plots:** Generate scatter plots for all combinations of variables in a dataframe using `pairs(dataframe)` [17](#page=17) [3](#page=3). * **Limiting plot axes:** Use `xlim = c(min, max)` and `ylim = c(min, max)` to set the display range for the x and y axes, respectively [3](#page=3). * **Adding text to plots:** The `text(x_coordinate, y_coordinate, "Your text")` function allows adding custom text labels to specific locations on a plot [3](#page=3). --- # Probability distributions and hypothesis testing This section covers the fundamentals of probability distributions, specifically binomial and normal distributions, and then delves into the principles of hypothesis testing, including its components, potential errors, and related metrics. ### 4.1 Probability distributions Probability distributions are graphical representations of the probabilities of different outcomes within a dataset or experiment. They are theoretical and describe the likelihood of each possible result [10](#page=10). #### 4.1.1 Binomial distribution The binomial distribution is used for discrete data and models scenarios with a fixed number of independent trials, each having only two possible outcomes: "success" and "failure" [10](#page=10). **Conditions for a binomial distribution:** * There are exactly two outcomes for each trial: success and failure [10](#page=10). * The number of trials is fixed [10](#page=10). * Each trial is independent of the others [10](#page=10). * The probability of success ($p$) is the same for every trial [10](#page=10). **Key functions for binomial distribution:** * `dbinom(x, size, prob)`: Calculates the probability of an *exact* number of successes. * `x`: The exact number of successes. * `size`: The total number of trials. * `prob`: The probability of success in a single trial. * `pbinom(q, size, prob)`: Calculates the *cumulative* probability, meaning the probability of getting up to and including a certain number of successes. * `q`: The maximum number of successes to include. * To find the probability of *more than* a certain number of successes (e.g., $P(X > x)$), you can use `1 - pbinom(x, size, prob)` [5](#page=5). * To find the number of successes required for a certain cumulative probability, you can use `qbinom(cumulative probability, size, prob)` [5](#page=5). **Visualizing binomial distributions:** * `barplot(dbinom(x_values, size, prob))`: Creates a bar plot showing the probability of each outcome in the distribution [6](#page=6). * `barplot(pbinom(q_values, size, prob))`: Plots cumulative probabilities [6](#page=6). > **Tip:** When defining success and failure for a binomial distribution, ensure they are mutually exclusive and cover all possibilities within a trial. #### 4.1.2 Normal distribution The normal distribution, also known as the Gaussian distribution or bell curve, is used for continuous data that is often normally distributed. It is characterized by its mean and standard deviation [10](#page=10) [6](#page=6). **Key function for normal distribution:** * `pnorm(q, mean, sd, lower.tail = TRUE)`: Calculates the cumulative probability for a normal distribution. * `q`: The value up to which to calculate the probability. * `mean`: The mean of the distribution. * `sd`: The standard deviation of the distribution. * `lower.tail = TRUE` (default): Calculates $P(X \le q)$. * `lower.tail = FALSE`: Calculates $P(X > q)$ (upper tail probability). > **Tip:** For continuous data, the probability of any single exact value is theoretically zero. We are interested in ranges of values. ### 4.2 Hypothesis testing Hypothesis testing is a statistical method used to make conclusions or predictions about a population based on sample data. It involves formulating hypotheses and using sample evidence to determine if the hypotheses can be rejected [10](#page=10). #### 4.2.1 Null and alternative hypotheses * **Null hypothesis ($H_0$)**: This is the default assumption, typically stating that there is no effect, no difference, or no relationship. It is assumed to be true until evidence suggests otherwise [10](#page=10) [15](#page=15). * **Alternative hypothesis ($H_A$ or $H_1$)**: This is a statement that contradicts the null hypothesis, proposing that there is an effect, a difference, or a relationship. The null and alternative hypotheses are mutually exclusive [10](#page=10) [15](#page=15). **Steps in hypothesis testing:** 1. **Formulate hypotheses**: Define the null ($H_0$) and alternative ($H_A$) hypotheses [15](#page=15). 2. **Set significance level ($\alpha$)**: This is the probability of rejecting the null hypothesis when it is actually true (Type I error). Commonly set at 0.05 [10](#page=10) [15](#page=15). 3. **Collect data**: Gather sample data relevant to the hypotheses. 4. **Calculate test statistic**: Compute a statistic based on the sample data. 5. **Determine p-value**: The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample, assuming the null hypothesis is true [10](#page=10) [15](#page=15). 6. **Make a decision**: * If $p$-value $\le \alpha$, reject $H_0$ in favor of $H_A$ [10](#page=10) [15](#page=15). * If $p$-value $> \alpha$, fail to reject $H_0$ [10](#page=10) [15](#page=15). > **Tip:** The p-value does not measure the effect size or confirm the truth of the hypothesis; it only indicates how compatible the data is with the null hypothesis. #### 4.2.2 Type I and Type II errors Errors can occur in hypothesis testing, leading to incorrect conclusions. * **Type I Error (False Positive)**: Rejecting the null hypothesis ($H_0$) when it is actually true. The probability of a Type I error is denoted by $\alpha$ (the significance level). The sample size does not affect the probability of a Type I error [10](#page=10) [15](#page=15). * **Type II Error (False Negative)**: Failing to reject the null hypothesis ($H_0$) when it is actually false. The probability of a Type II error is denoted by $\beta$. Increasing the sample size can reduce the probability of a Type II error [10](#page=10) [15](#page=15). #### 4.2.3 Power **Statistical power** is the probability of correctly rejecting a false null hypothesis; it is the complement of the Type II error rate ($1 - \beta$) [10](#page=10) [15](#page=15). * Higher power means a greater likelihood of detecting a real effect if one exists [15](#page=15). * A powerful study is more reliable and increases certainty [15](#page=15). **Factors that increase power:** * Larger sample size [15](#page=15). * Larger effect size [15](#page=15). * Higher significance level ($\alpha$) (though this also increases Type I error risk) [15](#page=15). * Using a one-tailed test (when justified) [15](#page=15). * Lower variance within the data [15](#page=15). #### 4.2.4 Effect size **Effect size** quantifies the magnitude or practical importance of a statistical difference or relationship, indicating how meaningful the finding is in the real world [10](#page=10) [15](#page=15). * A study can achieve a statistically significant p-value with a small effect size, especially with large sample sizes, meaning the result is statistically detectable but not practically important [15](#page=15). * Effect size can be measured using metrics like Cohen's $d$, the correlation coefficient ($r$), or $R^2$ [15](#page=15). **Factors that increase effect size:** * A larger true difference between groups [15](#page=15). * Lower variability in the data [15](#page=15). * Less measurement error [15](#page=15). * Improved study design, such as controlled experiments and reduced confounding variables [15](#page=15). > **Tip:** While the p-value tells you if an effect is statistically significant, the effect size tells you if it is practically important. ### 4.3 Key concepts and functions in R * **Binomial Distribution Functions:** * `dbinom(x, size, prob)`: Probability of an exact number of successes. * `pbinom(q, size, prob)`: Cumulative probability up to $q$ successes. * `qbinom(p, size, prob)`: The number of successes for a given cumulative probability $p$. * **Normal Distribution Function:** * `pnorm(q, mean, sd)`: Cumulative probability for a normal distribution. * **Hypothesis Testing Function (Binomial):** * `binom.test(x, n, p, alternative)`: Performs a binomial test. * `x`: Number of successes. * `n`: Number of trials. * `p`: Hypothesized probability of success under $H_0$. * `alternative`: Specifies "less", "greater", or "two.sided" for one- or two-tailed tests. > **Tip:** When performing a one-tailed binomial test, ensure the `alternative` argument correctly reflects the direction of the alternative hypothesis. --- # Specific statistical tests and ANOVA This section details the assumptions, applications, and interpretation of t-tests and Analysis of Variance (ANOVA), including methods for post-hoc analysis and multiple comparisons correction. ### 5.1 T-tests T-tests are statistical methods used to evaluate if there is a statistically significant difference between up to two samples. They utilize the mean and standard deviation of a sample to estimate its representation of the population [10](#page=10) [15](#page=15). #### 5.1.1 Types of T-tests * **One-sample t-test:** Used to compare a single sample's mean to a known population mean [10](#page=10) [15](#page=15). * **R Code:** `t.test(group, mu = x)` where `group` is the data for the sample and `mu` is the population mean being compared against [7](#page=7). * **Two-sample t-test:** Used to compare the means of two independent samples [10](#page=10) [15](#page=15). * **R Code:** `t.test(dataframe$sample1, dataframe$sample2)` [7](#page=7). * When performing a two-sample t-test in R, the assumption of equal variances is not automatically made. If variances are known to be equal, `var.equal = TRUE` can be added to the function, which may increase the test's power, though often with little advantage [10](#page=10) [15](#page=15). * **Paired t-test:** Applied when samples are closely related, analyzing data in pairs, such as measurements before and after an intervention on the same individuals [10](#page=10) [15](#page=15). * **R Code:** `t.test(dataframe$sample1, dataframe$sample2, paired = TRUE)` [7](#page=7). * For a paired t-test, the order of `pre_supplementation` and `post_supplementation` typically does not affect the results [7](#page=7). #### 5.1.2 Assumptions of T-tests T-tests rely on several assumptions about the data: 1. **Continuous Dependent Variable and Bivariate Independent Variable:** The dependent variable must be continuous, and the independent variable must be categorical with only two outcomes (bivariate). For example, comparing the effect of a "normal" diet versus a "western" diet (bivariate independent variable) on the time spent running on an exercise wheel (continuous dependent variable) [10](#page=10) [15](#page=15). 2. **Normal Distribution:** The population from which the samples are drawn is assumed to have a normal distribution. A normal quantile-quantile plot (Q-Q plot) can help assess if the data fits a normal distribution model by comparing dataset quantiles to theoretical normal distribution quantiles; a straight line indicates a good fit [10](#page=10) [15](#page=15). 3. **Equal Variances (Homoscedasticity):** The two populations from which the samples are drawn are assumed to have equal variances (spread of data). This can be checked by examining the variance or standard deviation of summary statistics. In R, `describeBy()` can be used to check summary statistics by group. A common heuristic is to consider variances equal if the ratio of the larger variance to the smaller variance is less than 4, though this is an estimation and may not be accurate for small sample sizes [10](#page=10) [15](#page=15). #### 5.1.3 Interpreting T-test Output * **Confidence Interval:** The 95% confidence interval for the difference between means is provided. If this interval does not include 0, it suggests a statistically significant difference, indicating that the population mean difference is unlikely to be zero [10](#page=10) [15](#page=15). * **P-value:** While the confidence interval is informative, the p-value is considered the most important indicator of significance [10](#page=10) [15](#page=15). ### 5.2 Analysis of Variance (ANOVA) ANOVA is a statistical test used to compare the means of three or more groups by comparing the variance within the groups to the variance between the groups. It determines if the observed differences between sample means are likely due to random variation or a genuine effect [10](#page=10) [15](#page=15). #### 5.2.1 Performing ANOVA in R **Assumptions for ANOVA:** * Data should be normally distributed [10](#page=10) [15](#page=15). * Observations within each group and between groups should be independent [10](#page=10) [15](#page=15). * Groups must have equal variances (homoscedasticity) [10](#page=10) [15](#page=15). **R Code:** * To perform an ANOVA test: `aov(dependent_variable ~ independent_variable, data = dataframe)` [8](#page=8). * To view the results: `summary(aov_test_output)` [10](#page=10) [15](#page=15). **Understanding ANOVA Output:** * The output typically includes Sum of Squares, Degrees of Freedom (DF), Mean Squares, an F-statistic, and a p-value [10](#page=10) [15](#page=15). * **F-statistic:** Calculated as the ratio of the variance between groups to the variance within groups. A high F-statistic suggests that variation between groups is larger than within groups, potentially indicating a statistically significant effect [10](#page=10) [15](#page=15). * **P-value:** Indicates the probability of observing the data (or more extreme data) if the null hypothesis (all group means are equal) were true. A p-value below the significance level (alpha) leads to rejecting the null hypothesis [10](#page=10) [15](#page=15). * **Reporting ANOVA:** Results are typically reported as $F(\text{df}_{\text{between}}, \text{df}_{\text{within}}) = \text{F-value}, p = \text{p-value}$. For example: $F(2, 42) = 39.45, p = 2.21 \times 10^{-10}$ [10](#page=10) [15](#page=15). #### 5.2.2 Post-Hoc Tests If an ANOVA test yields a significant result (i.e., a significant p-value), it indicates that at least one group mean differs from the others, but it does not specify *which* groups differ. Post-hoc tests are used to identify these specific differences [10](#page=10) [15](#page=15). * **Tukey's Honestly Significant Difference (HSD) Test:** A common post-hoc test used after ANOVA. * **R Code:** `TukeyHSD(aov_test_output)` [8](#page=8). * The output provides the differences between conditions and adjusted p-values. Adjusted p-values below the significance level indicate significant differences between specific group pairs [10](#page=10) [15](#page=15). ### 5.3 Correcting for Multiple Comparisons When performing multiple statistical tests, the probability of obtaining a false positive (Type I error) increases. This is known as the **family-wise error rate (FWER)**. Methods to correct for multiple testing are employed to maintain a desired overall significance level [10](#page=10) [15](#page=15). * **Bonferroni Correction:** This method controls the family-wise error rate by dividing the original alpha level by the number of tests performed. It is a strict method, reducing the chance of false positives but increasing the risk of false negatives (Type II errors), thus decreasing statistical power. It is best used when the number of tests is small and tolerance for false positives is very low [10](#page=10) [15](#page=15). * **Benjamini-Hochberg (BH) Correction:** This method controls the **false discovery rate (FDR)**, which is the expected proportion of false positives among the rejected null hypotheses. It is generally more powerful than Bonferroni, especially with a large number of tests, as it allows for some false positives while limiting their overall proportion. If an adjusted p-value (using BH) is below 0.05, it is expected that about 5% of these significant results might be false positives [10](#page=10) [15](#page=15). * **R Code:** `p.adjust(p_values, method = "bonferroni")` or `p.adjust(p_values, method = "BH")`. The `p_values` would be a vector or column of p-values obtained from multiple tests [9](#page=9). ### 5.4 Power and Sample Size Calculation The `pwr` package in R can be used for power and sample size calculations. The `pwr.t.test()` function can be used for t-tests, requiring three of the four values: sample size (`n`), effect size (`d` for Cohen's d), significance level (`sig.level`), and power (`power` = 1 - $\beta$) [7](#page=7). --- ## Common mistakes to avoid - Review all topics thoroughly before exams - Pay attention to formulas and key definitions - Practice with examples provided in each section - Don't memorize without understanding the underlying concepts

Glossary

| Term | Definition | |---|---| | Dataframe | A two-dimensional data structure in R, similar to a table, where columns can contain different data types. | | Vector | A one-dimensional array in R that can hold a sequence of elements of the same basic type, such as numbers, characters, or logical values. | | Package | A collection of R functions, data, and compiled code that can be loaded into an R session to extend its functionality. | | Working directory | The default location on your computer where R looks for files to load and saves files to by default. | | Factor | A data structure in R used to store categorical data, where values are treated as categories or labels. | | Boxplot | A graphical representation that displays the distribution of data through their quartiles, with outliers often plotted as individual points. | | Histogram | A graphical representation of the distribution of numerical data, where the data is binned, and the frequency of data points in each bin is shown as bars. | | QQ plot (Quantile-Quantile plot) | A graphical tool used to assess whether a dataset follows a certain distribution, typically comparing the quantiles of the sample data against the quantiles of a theoretical distribution. | | Null hypothesis (H0) | A statement that there is no significant difference or relationship between variables or groups, serving as a baseline for statistical testing. | | Alternative hypothesis (HA) | A statement that contradicts the null hypothesis, suggesting there is a significant difference or relationship between variables or groups. | | P-value | The probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is true. | | Significance level (alpha) | A threshold used in hypothesis testing to determine whether to reject the null hypothesis. Commonly set at 0.05. | | Correlation coefficient | A statistical measure that indicates the strength and direction of a linear relationship between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation). | | Linear regression | A statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. | | Residuals | The difference between an observed value and the value predicted by a statistical model, often used to assess the model's fit. | | T-test | A statistical hypothesis test used to determine if there is a significant difference between the means of two groups or between a sample mean and a population mean. | | Paired t-test | A specific type of t-test used when the observations are paired or related, such as measurements taken from the same subject before and after an intervention. | | ANOVA (Analysis of Variance) | A statistical test used to compare the means of three or more groups to determine if there are any statistically significant differences among them. | | Tukey's HSD (Honestly Significant Difference) | A post-hoc test used after ANOVA to determine which specific pairs of group means are significantly different from each other. | | Bonferroni correction | A method used to control the family-wise error rate when performing multiple statistical tests, by adjusting the significance level for each test. | | False Discovery Rate (FDR) | The expected proportion of 'discoveries' (i.e., rejected null hypotheses) that are actually false positives. | | Standard Error of the Mean (SEM) | A measure of the dispersion of sample means around the population mean, calculated as the sample standard deviation divided by the square root of the sample size. | | Confidence Interval (CI) | A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter with a certain level of confidence (e.g., 95% CI). | | Effect size | A measure of the magnitude of a phenomenon, indicating the strength of the relationship or difference between variables, independent of sample size. | | Type I error (False Positive) | The error of rejecting the null hypothesis when it is actually true. | | Type II error (False Negative) | The error of failing to reject the null hypothesis when it is actually false. | | Power | The probability of correctly rejecting the null hypothesis when it is false, essentially the ability of a test to detect a true effect. | | Observational study | A study where researchers observe subjects and measure variables of interest without assigning treatments or interventions. | | Experimental study | A study where researchers manipulate one or more variables (independent variables) and observe their effect on a dependent variable, while controlling other factors. | | Independent variable | The variable that is manipulated or changed by the researcher in an experiment to observe its effect on the dependent variable. | | Dependent variable | The variable that is measured in an experiment to see if it is affected by changes in the independent variable. | | Confounding variable | An extraneous variable that can influence both the independent and dependent variables, potentially distorting the observed relationship. | | Technical replicates | Multiple measurements taken from the same biological sample to assess the precision and reliability of the experimental technique. | | Biological replicates | Independent samples from different biological sources that are subjected to the same experimental conditions to account for natural biological variability. | | Negative control | A group or condition in an experiment where no effect is expected, used as a baseline for comparison. | | Positive control | A group or condition in an experiment where an effect is known to occur, used to validate the experimental setup and confirm that the system is responsive. | | Descriptive statistics | Statistical methods used to summarize and describe the main features of a dataset, such as mean, median, and standard deviation. | | Inferential statistics | Statistical methods used to draw conclusions or make predictions about a population based on a sample of data. | | Binomial distribution | A probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. | | Normal distribution | A continuous probability distribution characterized by its bell-shaped curve, where data is symmetrically distributed around the mean. | | Hypothesis testing | A statistical method used to determine whether there is enough evidence in a sample of data to infer that there is a significant difference or relationship in the population. | | Randomization | The process of assigning subjects to treatment groups by chance, to minimize bias and ensure that groups are comparable. | | Blinding | A procedure in clinical trials where participants (and sometimes researchers) are unaware of which treatment group participants have been assigned to, to prevent bias. | | Placebo effect | A phenomenon where a participant's belief in a treatment can lead to a perceived or actual improvement in their condition, even if the treatment is inert. | | Questionable Research Practices (QRPs) | Actions such as cherry-picking data or p-hacking that can lead to biased results or misleading conclusions, even if not outright fabrication or falsification. | | Publication bias | The tendency for studies with statistically significant results to be more likely to be published than studies with non-significant results. | | Sampling error | The difference between a sample statistic and the corresponding population parameter, due to the random nature of sampling. | | Bias | A systematic error that leads to a distortion of results, causing them to deviate from the true value. | | Mean | The average of a set of numbers, calculated by summing all values and dividing by the count of values. | | Median | The middle value in a dataset that has been ordered from least to greatest. | | Standard deviation | A measure of the amount of variation or dispersion in a set of data values, indicating how spread out the data is from the mean. | | Correlation | A statistical measure that describes the extent to which two variables change together. | | Regression | A statistical technique used to estimate the relationship between a dependent variable and one or more independent variables. | | F-statistic | A statistic used in ANOVA and regression analysis that measures the ratio of variance between groups to variance within groups. | | Sum of squares | A measure of the total variability in a dataset, calculated as the sum of the squared differences between each data point and the mean. | | Degrees of freedom | The number of independent values that can be freely assigned when estimating a parameter. | | Tukey’s honest significant difference (HSD) | A statistical test used in ANOVA to find out which specific groups differ from each other. | | Family-wise error rate (FWER) | The probability of making at least one Type I error (false positive) when performing multiple hypothesis tests. | | False Discovery Rate (FDR) | The expected proportion of rejected null hypotheses that are actually false positives across a set of tests. | | Replicates | Repetitions of an experiment or measurement, used to assess variability and increase the reliability of results. | | Blocking | A technique in experimental design where experimental units are grouped into homogeneous blocks to reduce variability and improve the precision of treatment comparisons. | | Randomization | The process of randomly assigning subjects to different treatment groups to minimize bias and ensure that groups are comparable. | | Control group | A group in an experiment that does not receive the treatment being tested, serving as a baseline for comparison. | | Bar plot | A chart that displays categorical data with rectangular bars with heights or lengths proportional to the values that they represent. | | Continuous data | Data that can take any value within a given range, such as height or temperature. | | Discrete data | Data that can only take specific, distinct values, often integers, such as the number of heads in coin flips. |