Cover
Jetzt kostenlos starten 1-Basic Statistical Concepts.pdf
Summary
# Introduction to biostatistics and data variables
This section provides a foundational understanding of statistics and biostatistics, outlining their purpose and the critical importance of classifying data variables for effective analysis [2](#page=2) [3](#page=3).
### 1.1 Statistics and biostatistics
#### 1.1.1 What is statistics?
Statistics is the scientific discipline focused on developing and utilizing methods for collecting, analyzing, interpreting, and presenting data [2](#page=2).
#### 1.1.2 What is biostatistics?
Biostatistics applies statistical principles specifically within the domains of medicine, public health, and biology [2](#page=2).
#### 1.1.3 Utility of studying biostatistics
Studying biostatistics is valuable for several reasons:
* Designing and analyzing research studies [2](#page=2).
* Describing and summarizing collected data [2](#page=2).
* Analyzing data to generate scientific evidence supporting a hypothesis [2](#page=2).
* Determining if an observation is statistically significant or merely due to chance [2](#page=2).
* Understanding and critically evaluating published scientific research [2](#page=2).
* It forms a fundamental component of fields like clinical trials and epidemiological studies [2](#page=2).
#### 1.1.4 The statistical analysis journey
The process of statistical analysis typically involves the following steps:
* Formulating a research question from an initial research idea [2](#page=2).
* Selecting an appropriate study design and a suitable sample [2](#page=2).
* Conducting the study and gathering data [2](#page=2).
* Analyzing the data using the correct statistical method [2](#page=2).
* Obtaining and interpreting the p-value [2](#page=2).
* Drawing a conclusion or answering the research question [2](#page=2).
### 1.2 Data variables
A data variable is defined as "something that varies" or differs among individuals or groups. These are the elements about which data is collected. Examples include sex, age, weight, marital status, and satisfaction rate [3](#page=3).
#### 1.2.1 Importance of variable classification
Recognizing the type of each data variable is crucial for several reasons:
* **Data summarization:** The method used for summarization (e.g., mean with standard deviation versus frequency with percentage) depends on the variable type [3](#page=3).
* **Graphical presentation:** The choice of appropriate graph for data visualization is dictated by the variable type [3](#page=3).
* **Data analysis:** Selecting suitable statistical tests is contingent upon the type of data variables involved [3](#page=3).
#### 1.2.2 General classification of data variables
Data variables are generally classified into two main types [3](#page=3):
* **A. Categorical variables:** These are further classified as nominal or ordinal [3](#page=3).
* **B. Numerical variables:** These are further classified as discrete or continuous [3](#page=3).
### 1.3 Categorical variables
Categorical variables, also known as qualitative or nominal data, do not possess a unit of measurement. They consist of distinct categories, and individuals are assigned to one of these categories [4](#page=4).
**Examples of categorical variables:**
* Satisfaction status (e.g., satisfied, neutral, not satisfied) [4](#page=4).
* Sex (e.g., male, female) [4](#page=4).
* Nationality (e.g., listing all countries) [4](#page=4).
* Agreement level (e.g., strongly disagree, disagree, undecided, agree, strongly agree) [4](#page=4).
> **Tip:** Categorical variables can sometimes be coded with numbers (e.g., 1 for female, 2 for male). Even when represented by numbers, they remain categories, and the numbers function solely as codes, not as actual numerical values [4](#page=4).
#### 1.3.1 Types of categorical variables
Categorical variables are sub-classified into nominal and ordinal types [5](#page=5).
##### 1.3.1.1 Nominal variables
Nominal variables are categorical variables that have no intrinsic order or ranking. The order in which these categories are presented is arbitrary [5](#page=5).
**Examples of nominal variables:**
* Sex (male, female) can be listed in any order [5](#page=5).
* Blood groups (A, B, AB, O) can be ordered in various ways [5](#page=5).
* Nationality cannot be inherently ordered [5](#page=5).
> **Tip:** A nominal variable with only two categories (e.g., sex, yes/no answers, disease status) is termed a dichotomous or binomial variable [5](#page=5).
##### 1.3.1.2 Ordinal variables
Ordinal variables are categorical variables that possess an order or ranking, and this order is meaningful [5](#page=5).
**Examples of ordinal variables:**
* BMI status (e.g., underweight, normal, overweight, obese, extremely obese) [5](#page=5).
* Agreement level (e.g., strongly disagree, disagree, undecided, agree, strongly agree) [5](#page=5).
### 1.4 Numerical variables
Numerical variables are those that are measured or counted, are represented by numbers, and have a unit of measurement [6](#page=6).
**Examples of numerical variables:**
* Waist circumference (in centimeters) [6](#page=6).
* Weight (in kilograms) [6](#page=6).
* Blood glucose level (in mg/dL) [6](#page=6).
* Number of children in a family [6](#page=6).
Numerical variables are classified as either discrete or continuous [6](#page=6).
#### 1.4.1 Discrete variables
Discrete variables can only take on integer values (whole numbers) without decimals, such as 0, 1, 2, 3, etc.. They typically represent counts of something [6](#page=6).
**Examples of discrete variables:**
* Number of children in a family [6](#page=6).
* Number of stents inserted during a procedure [6](#page=6).
* Number of patient visits to a hospital [6](#page=6).
The unit of measurement in these cases indicates what is being counted (e.g., child, stent, visit) [6](#page=6).
#### 1.4.2 Continuous variables
Continuous variables can assume any real numerical value, including decimals (e.g., 14.55, 48.8, 178.2). They involve measurement and are associated with measurement units [7](#page=7).
**Examples of continuous variables:**
* Weight (in kilograms) [7](#page=7).
* Height (in centimeters) [7](#page=7).
* Blood glucose level (in mg/dL) [7](#page=7).
### 1.5 Differentiating data variable types
A systematic approach can be used to distinguish between different types of data variables [7](#page=7):
**Step 1: Check for a unit of measurement.**
* If a unit of measurement is absent, the variable is **categorical** [7](#page=7).
* If a unit of measurement is present, the variable is **numerical** [7](#page=7).
**Step 2: Further classify based on the initial determination.**
* **For categorical variables:**
* **Is there an order?**
* If No, it is **nominal** [7](#page=7).
* If Yes, it is **ordinal** [7](#page=7).
* **For numerical variables:**
* **Is it counted or measured?**
* If counted, it is **discrete** [7](#page=7).
* If measured, it is **continuous** [7](#page=7).
#### 1.5.1 Additional classifications of numerical variables
Some texts further divide numerical data into interval and ratio variables [8](#page=8).
* **Ratio variables:** Possess a true zero point, signifying complete absence. For example, zero weight means no weight, and 30 kgs is twice 15 kgs [8](#page=8).
* **Interval variables:** Lack a true zero. For instance, 0 degrees Celsius does not mean absence of heat, and 30 degrees Celsius is not twice as hot as 15 degrees Celsius [8](#page=8).
> **Tip:** Ordinal variables with many levels (e.g., a pain score on a 10-point scale) can often be treated as discrete variables in statistical analysis [8](#page=8).
> **Tip:** Continuous variables are sometimes recorded as discrete if they are measured to a certain precision. For example, age is often reported in whole years rather than exact age [8](#page=8).
### 1.6 Levels of data measurement and conversion
It is possible to convert data variables to a less precise type, but not vice versa. The hierarchy of data types, from most to least precise, is [9](#page=9):
`numerical continuous → numerical discrete → ordinal → nominal` [9](#page=9).
* Age, a numerical variable, can be converted to an ordinal variable by grouping it into age categories (e.g., young, middle-aged, old) [9](#page=9).
* These age categories (ordinal) can then be further simplified into a nominal variable with two levels (e.g., young, old) [9](#page=9).
* However, if data is collected in a categorical format, it cannot be transformed back into a numerical format [9](#page=9).
> **Tip:** Whenever feasible, collect data at the highest level of precision (numerical continuous or numerical discrete) because it offers more detail and can always be categorized later if required [9](#page=9).
### 1.7 Explanatory and response variables
When investigating a potential relationship where one variable is hypothesized to influence another, variables are termed explanatory and response variables. For instance, if BMI is thought to affect quality-of-life score, BMI is the explanatory variable, and quality-of-life score is the response variable [10](#page=10).
* The **explanatory variable** is also known as the independent variable or predictor variable [10](#page=10).
* The **response variable** is also known as the dependent variable or outcome variable [10](#page=10).
**Summary of variable types:**
* Data variables are classified as categorical or numerical based on the presence of a unit of measurement [10](#page=10).
* Categorical variables lack a unit of measurement and are either nominal (no intrinsic order) or ordinal (with meaningful order) [10](#page=10).
* Categorical variables with two levels are called binomial variables [10](#page=10).
* Numerical variables are measured or counted and are either continuous (any real value) or discrete (integer values) [10](#page=10).
---
# Data entry, exploration, and descriptive statistics
This section covers the essential steps of preparing and summarizing data for statistical analysis, from initial data entry to the calculation and interpretation of descriptive statistics.
### 2.1 Data entry
Effective data entry is crucial for ensuring the accuracy and usability of data for analysis. The primary goal is to arrange data in a spreadsheet format with specific characteristics for clarity and software compatibility [11](#page=11).
#### 2.1.1 Spreadsheet structure
A well-organized datasheet should follow these principles:
* **Columns represent variables:** Each column should contain data for a single variable. If a variable is measured multiple times (e.g., before and after an experiment), each measurement should occupy a separate column. Similarly, if a variable has components (e.g., blood pressure with systolic and diastolic), each component needs its own column [11](#page=11).
* **Uniform units:** All data within a single column must use the same unit of measurement. For instance, height should consistently be in meters or centimeters, and age in years or months [11](#page=11).
* **Rows represent cases:** Each row should represent a single unit of observation, such as a patient, animal, or location [11](#page=11).
* **Single data point per cell:** Each cell in the spreadsheet should contain only one data point, not combined values like systolic and diastolic blood pressure together [11](#page=11).
* **Numerical coding for categorical data:** Nominal and ordinal variables are best entered using numerical codes instead of text. For example, "Male" can be coded as 1 and "Female" as 2. A codebook detailing these numerical codes and their corresponding values should be maintained, ideally in a separate sheet within the same file [11](#page=11) [12](#page=12).
#### 2.1.2 Coding categorical data
Using numerical codes for categorical data simplifies entry, reduces typing errors, and enhances compatibility with statistical software. Recommended coding schemes include [12](#page=12):
* **Severity scales:** Mild Moderate Severe [12](#page=12) [1](#page=1) [2](#page=2) [3](#page=3).
* **Pain scales:** No pain Mild pain Moderate pain Severe pain [12](#page=12) [1](#page=1) [2](#page=2) [3](#page=3).
* **Binary variables:** Yes No [12](#page=12) [1](#page=1).
For questions allowing multiple answers, a separate column for each choice should be used, coded as 1 for "Yes" and 0 for "No". If a variable has open-ended answers or a very large number of possible responses, these answers must be evaluated and categorized into a limited number of groups for statistical analysis [12](#page=12) [13](#page=13).
#### 2.1.3 Tips for numerical data entry
* Be precise with decimal places [13](#page=13).
* Enter numbers as digits, not words [13](#page=13).
* Maintain consistent units (e.g., all in kilograms or all in pounds) [13](#page=13).
* Do not include units in the data cells [13](#page=13).
* Record basic measurements (e.g., weight, height) and calculate derived variables (e.g., BMI) later [13](#page=13).
* Collect exact values (e.g., exact age) rather than categorized ranges [13](#page=13).
* Ensure each cell contains only one data element (e.g., gestational age in days or weeks, not both) [13](#page=13).
#### 2.1.4 Coding missing data
Missing data should be coded using impossible values that cannot occur as valid data points for that variable. This distinguishes missing values from potential data entry errors. Examples include [13](#page=13):
* Binary variables (1, 0): Use 9 [13](#page=13).
* Categorical variables with three categories (1, 2, 3): Use 9 [13](#page=13).
* Age in years: Use 99 [13](#page=13).
* Weight in kilograms: Use 999 [13](#page=13).
It's important to note that "Refused to answer" and "Not applicable" are distinct from missing data and should be assigned different codes (e.g., 998, 997). Crucially, these missing data codes must be designated as "missing" within the statistical software to prevent them from being included in analyses inappropriately [13](#page=13).
### 2.2 Exploring data for errors
Before conducting statistical analysis, it's essential to explore the dataset for potential errors. Common techniques include [14](#page=14):
* **Checking minimum and maximum values:** Identify any extreme values that appear incorrect or inconsistent with other data [14](#page=14).
* **Frequency distribution for categorical variables:** Examine the counts and categories to detect typing mistakes or unusual codes [14](#page=14).
* **Checking missing values:** Verify if missing data is genuinely unavailable or was overlooked during entry [14](#page=14).
* **Checking data consistency:** Ensure logical relationships between variables are maintained (e.g., a male cannot be pregnant, disease duration cannot exceed age, diastolic blood pressure cannot be greater than systolic blood pressure) [14](#page=14).
* **Graphical exploration:** Use tools like histograms or boxplots for single numerical variables, and scatterplots for relationships between two numerical variables, to visually identify errors [14](#page=14).
#### 2.2.1 Dealing with missing data
Missing data can reduce statistical power and introduce bias. Several approaches exist [14](#page=14):
* **Do nothing:** Proceed with analysis, allowing the software to ignore missing values [14](#page=14).
* **List-wise deletion (complete case analysis):** Remove entire cases that have missing data. This is often applied to participants with substantial missing information or those who completed less than a certain percentage of a questionnaire [14](#page=14) [15](#page=15).
* **Last observation carried forward (LOCF):** In longitudinal studies, the last recorded value is used to fill subsequent missing data points [15](#page=15).
* **Mean imputation:** Replace missing values with the mean of the variable [15](#page=15).
* **Regression imputation:** Use a regression model based on available data to estimate missing values [15](#page=15).
> **Tip:** Always document the method used for handling missing data in your analysis.
#### 2.2.2 Summary of data entry recommendations
* Variables should be in columns, and cases in rows [15](#page=15).
* Each cell must contain a single data point [15](#page=15).
* Units of measurement must be consistent within each variable [15](#page=15).
* Use codes for categorical variables and missing values [15](#page=15).
* Always check data for potential errors [15](#page=15).
### 2.3 Descriptive statistics
Descriptive statistics are used to summarize and present data in a meaningful way, either numerically or graphically. They are fundamental in research for describing study subjects and in everyday life for reporting various metrics. The method of description depends on the type of variable [16](#page=16).
#### 2.3.1 Descriptive statistics for categorical variables
Categorical variables (e.g., sex, smoking status, disease severity) are described using:
* **Frequencies (numbers):** The count of individuals within each category [16](#page=16).
* **Relative frequencies (percentages):** The proportion of individuals in each category, calculated as (frequency / total number) \* 100 [16](#page=16).
> **Example:** If out of 200 participants, 120 are male and 80 are female:
> Males: 120 (60%)
> Females: 80 (40%)
#### 2.3.2 Descriptive statistics for numerical variables
Numerical variables are typically described using measures of central tendency (to represent the center of the data) and measures of dispersion (to represent the spread or variability of the data) [17](#page=17).
##### 2.3.2.1 Measures of central tendency
These statistics indicate the typical value in a dataset.
* **Mean:** The sum of all observed values divided by the number of observations. It is also known as the average or arithmetic mean [17](#page=17).
* Formula: $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$
* **Sensitivity to outliers:** The mean is highly influenced by extreme values [17](#page=17) [18](#page=18).
> **Example:** For ages 7, 5, 6, 8, 2, 9, 3, the mean is $\frac{7+5+6+8+2+9+3}{7} = \frac{40}{7} \approx 5.71$ years. If 64 is added, the mean becomes $\frac{104}{8} = 13$ years, showing a significant shift [17](#page=17) [18](#page=18).
* **Median:** The middle value in a dataset when the data is ordered from smallest to largest. Half of the data points are above the median, and half are below it [18](#page=18).
* **Calculation:**
1. Order the data from smallest to largest [18](#page=18).
2. If the number of observations ($n$) is odd, the median is the middle value.
3. If $n$ is even, the median is the average of the two middle values [19](#page=19).
* **Robustness to outliers:** The median is not significantly affected by extreme values [18](#page=18) [19](#page=19).
> **Example:** For ordered ages 2, 3, 5, 6, 7, 8, 9, the median is 6 years. With age 64 added (2, 3, 5, 6, 7, 8, 9, 64), the median is the average of the two middle values: $\frac{6+7}{2} = 6.5$ years [18](#page=18) [19](#page=19).
* **Mode:** The value that occurs most frequently in the dataset. It can be used for both numerical and categorical variables. A dataset can have one mode (unimodal), multiple modes (bimodal if two, multimodal if more), or no mode. The mode is less commonly used in scientific research for numerical data [20](#page=20).
| Measure | Advantages | Disadvantages |
| :------ | :--------------------------------------- | :-------------------------------------------- |
| Mean | Uses all data values, algebraically defined | Distorted by extreme/skewed data |
| Median | Not distorted by extreme/skewed data | Ignores most information, not algebraically defined |
| Mode | Easily determined for categorical data | Ignores most information, not algebraically defined |
##### 2.3.2.2 The five-number summary and percentiles
The five-number summary divides ordered data into four quarters and consists of five key values:
* **Minimum:** The smallest value in the dataset [20](#page=20).
* **First Quartile (Q1):** The 25th percentile; 25% of the data falls below this value. It is the median of the lower half of the data [20](#page=20).
* **Median (Q2):** The 50th percentile; 50% of the data falls below this value [20](#page=20).
* **Third Quartile (Q3):** The 75th percentile; 75% of the data falls below this value. It is the median of the upper half of the data [20](#page=20).
* **Maximum:** The largest value in the dataset [20](#page=20).
> **Example:** For the data: 8, 10, 10, 10, 12, 14, 15, 15, 18, 23, 25, 27
> Minimum: 8
> Q1: 10
> Median: $\frac{14+15}{2} = 14.5$
> Q3: $\frac{18+23}{2} = 20.5$ (Note: The document's example output for Q3 is 21.75, implying a specific calculation method for quartiles that may differ slightly depending on software implementation. For exams, follow the principles: Q1 is the median of the lower half, Q3 is the median of the upper half, or consult specific textbook definitions if provided.)
> Maximum: 27
> The five-number summary can be graphically represented by a boxplot [21](#page=21).
**Percentiles:** Data is divided into 100 equal parts. The $k$-th percentile is the value below which $k$% of observations lie. The 25th, 50th, and 75th percentiles correspond to Q1, the median, and Q3, respectively. Percentiles are useful for comparing scores (e.g., test performance) and defining normal ranges in medicine (e.g., 5th to 95th percentiles for growth charts) [22](#page=22).
##### 2.3.2.3 Measures of dispersion
These statistics describe the spread or variability of data.
* **Range:** The difference between the maximum and minimum values in a dataset [23](#page=23).
* Formula: Range = Maximum value - Minimum value
* **Sensitivity to outliers:** The range is heavily affected by extreme values [23](#page=23).
* Sometimes reported as minimum and maximum values (e.g., range: 8, 27) instead of a single difference [23](#page=23).
> **Example:** For the ages 8, 10, 10, 10, 12, 14, 15, 15, 18, 23, 25, 27, the range is $27 - 8 = 19$ years [23](#page=23).
* **Inter-quartile range (IQR):** The difference between the third quartile (Q3) and the first quartile (Q1) [23](#page=23).
* Formula: $IQR = Q3 - Q1$
* Represents the spread of the middle 50% of the data [23](#page=23).
* **Robustness to outliers:** Not affected by extreme values as it doesn't use the minimum or maximum [23](#page=23).
> **Example:** For the data: 8,10,10,10,12,14,15,15,18,23,25,27, with Q1=10 and Q3=21.75 (using the document's provided Q3 for this dataset): $IQR = 21.75 - 10 = 11.75$ [23](#page=23).
* **Variance ($s^2$):** A measure of spread that considers all data points. It represents the average squared distance of data points from the mean [24](#page=24).
* **Steps to calculate:**
1. Calculate the mean ($\bar{x}$) [24](#page=24).
2. Calculate the squared difference between each data point ($x_i$) and the mean: $(x_i - \bar{x})^2$ [24](#page=24).
3. Sum all these squared differences: $\sum (x_i - \bar{x})^2$ [24](#page=24).
4. Divide the sum by the number of observations minus 1 ($n-1$) for sample variance: $s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$ [24](#page=24).
* **Units:** Variance is in squared units (e.g., meters squared if the data is in meters), making interpretation difficult [24](#page=24) [25](#page=25).
> **Example:** For ages 7, 5, 6, 8, 4, 9, 3 (mean = 6):
> Squared differences: $(7-6)^2=1, (5-6)^2=1, (6-6)^2=0, (8-6)^2=4, (4-6)^2=4, (9-6)^2=9, (3-6)^2=9$.
> Sum of squared differences = $1+1+0+4+4+9+9 = 28$.
> Variance $s^2 = \frac{28}{7-1} = \frac{28}{6} \approx 4.67$ years$^2$ [24](#page=24).
* **Standard deviation (s):** The square root of the variance. It is a measure of spread that represents the average distance of data values from their mean and has the same units as the original data [25](#page=25).
* Formula: $s = \sqrt{s^2}$
* A larger standard deviation indicates greater spread, while a smaller one indicates data points are clustered closely around the mean [25](#page=25).
> **Example:** For the previous ages, the standard deviation is $s = \sqrt{4.67} \approx 2.16$ years [25](#page=25).
> **Tip:** When describing numerical variables, always report a measure of central tendency along with a measure of dispersion.
#### 2.3.3 Combining measures for numerical variables
When summarizing a numerical variable, it's standard practice to present two statistics: one for central tendency and one for dispersion [26](#page=26).
* For **normally distributed data**, use the mean and standard deviation [26](#page=26).
* For **non-normally distributed data** (or when outliers are present), use the median and inter-quartile range (IQR) [26](#page=26) [33](#page=33).
> **Example of a descriptive statistics table:**
> | Baseline characteristic | Group A (n %) | Group B (n %) | Group C (n %) |
> | :-------------------- | :------------ | :------------ | :------------ |
> | **Gender** | | | |
> | Female | 25 | 20 | 23 | .
> | Male | 25 | 30 | 27 | .
> | **Marital status** | | | |
> | Single | 13 | 11 | 17 | [22](#page=22) [26](#page=26) .
> | Married | 35 | 38 | 28 | .
> | Divorced or widowed | 1 | 1 | 4 | [2](#page=2) [8](#page=8).
> | **Age, mean (SD)** | 30.3 (12.4) | 29.4 (11.6) | 32.1 (11.9) |
#### 2.3.4 Coefficient of variation (CV)
The CV expresses the standard deviation as a proportion of the mean, multiplied by 100 [27](#page=27).
* Formula: $CV = \left(\frac{\text{Standard Deviation}}{\text{Mean}}\right) \times 100\%$
* It helps compare variability between different measures or datasets with different means, by controlling for the mean's influence [27](#page=27).
> **Example:**
> PHQ measure: Mean=7.5, SD=3.7. CV = (3.7 / 7.5) \* 100% = 49.3%
> GAD7 measure: Mean=6, SD=3.5. CV = (3.5 / 6) \* 100% = 58.3%
> GAD7 shows higher variability when the mean is considered [27](#page=27).
#### 2.3.5 Weighted mean
The weighted mean accounts for different weights or frequencies of observations when calculating an average. It is used when observations do not have equal importance or sample size [27](#page=27).
* **Calculation:** Sum of (value \* weight) divided by the sum of weights.
* Formula: $\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}$
* This is often used when combining averages from groups of different sizes [27](#page=27) [28](#page=28).
> **Example:** A student's final grade based on two assignments (15% each) and two quizzes (30% and 40%).
> Marks: Assignment 1=70, Assignment 2=85, Quiz 1=80, Quiz 2=90.
> Weights: 0.15, 0.15, 0.30, 0.40.
> Final Grade = (70 \* 0.15) + (85 \* 0.15) + (80 \* 0.30) + (90 \* 0.40)
> Final Grade = 10.5 + 12.75 + 24 + 36 = 83.25 [28](#page=28).
#### 2.3.6 Understanding the normal distribution
The normal distribution, or Gaussian distribution, is a common probability distribution characterized by a symmetrical, bell-shaped curve [29](#page=29).
* **Characteristics:**
* Symmetric around the mean [29](#page=29).
* Mean, median, and mode are approximately equal [29](#page=29).
* Denser in the center and less dense in the tails [29](#page=29).
* 50% of values are below the mean, and 50% are above [29](#page=29).
* Defined by its mean ($\mu$) and standard deviation ($\sigma$) [29](#page=29).
* **Empirical Rule (68-95-99.7 rule):**
* Approximately 68% of data falls within one standard deviation of the mean ($\mu \pm 1\sigma$) [29](#page=29).
* Approximately 95% falls within two standard deviations ($\mu \pm 2\sigma$) [29](#page=29).
* Approximately 99.7% falls within three standard deviations ($\mu \pm 3\sigma$) [29](#page=29).
> **Tip:** Normal distributions are common in biological measurements like height and blood pressure [29](#page=29).
**Standard Normal Distribution:** A normal distribution with a mean ($\mu$) of 0 and a standard deviation ($\sigma$) of 1 is called the standard normal distribution. Any normal distribution can be converted to a standard normal distribution using the z-score formula [31](#page=31):
* Formula: $z = \frac{x - \mu}{\sigma}$
* Where $z$ is the standardized score, $x$ is the original value, $\mu$ is the mean, and $\sigma$ is the standard deviation [31](#page=31).
#### 2.3.7 Non-normally distributed data
Data can be **skewed**, meaning it has a long tail on one side:
* **Positive skew (skewed to the right):** The tail is on the right side. The mean is typically pulled towards the tail (higher than the median) [33](#page=33).
* **Negative skew (skewed to the left):** The tail is on the left side. The mean is typically pulled towards the tail (lower than the median) [33](#page=33).
> **Note:** For non-normally distributed data, the median and IQR are preferred descriptive statistics over the mean and standard deviation due to the influence of extreme values in skewed distributions [33](#page=33).
#### 2.3.8 Summary of descriptive statistics
* **Categorical variables:** Use frequencies and percentages [33](#page=33).
* **Numerical variables:** Use one measure of central tendency and one measure of dispersion [33](#page=33).
* **Normally distributed data:** Mean and standard deviation [33](#page=33).
* **Non-normally distributed data:** Median and IQR [33](#page=33).
* Be aware that mean and standard deviation are sensitive to extreme values [33](#page=33).
---
# Data presentation and hypothesis testing
This section outlines effective methods for presenting data using tables and graphs and introduces the fundamental principles of hypothesis testing [34](#page=34).
### 3.1 Tabular presentation of data
Tables are crucial for presenting data in a clear and understandable manner. The method of presentation depends on the type of variable [34](#page=34).
#### 3.1.1 Nominal variables
Nominal variables, which lack an inherent order, can be presented using frequencies (counts) or relative frequencies (percentages) [34](#page=34).
* **Frequencies:** This involves listing the number of individuals in each category. Categories can be arranged alphabetically or by frequency for better readability [34](#page=34).
* **Relative frequencies (percentages):** Calculated by dividing the frequency of a category by the total number of individuals and multiplying by 100. This provides a more intuitive understanding of proportions [34](#page=34).
The formula for relative frequency is:
$$ \text{Relative frequency} = \frac{\text{Frequency of category}}{\text{Total frequency}} \times 100 $$
> **Example:** For Saudi nationals with a frequency of 55 out of 180 participants, the relative frequency is $\frac{55}{180} \times 100 \approx 30.6\%$ [34](#page=34) [35](#page=35).
#### 3.1.2 Ordinal variables
Ordinal variables have a natural order, which must be preserved in tabular presentations [35](#page=35).
* **Frequencies and relative frequencies:** Similar to nominal variables, ordinal data can be presented as counts or percentages [35](#page=35).
* **Cumulative relative frequencies:** This method leverages the ordered nature of ordinal variables. The cumulative relative frequency at a given level is the sum of its relative frequency and all preceding relative frequencies [36](#page=36).
> **Example:** If the cumulative relative frequency for "Satisfied" is 70.0%, it means 70.0% of individuals are either "Very satisfied" or "Satisfied" [36](#page=36).
#### 3.1.3 Numerical discrete variables
If a numerical discrete variable has a limited number of levels, it can be presented using frequencies, relative frequencies, and cumulative relative frequencies, much like ordinal variables [36](#page=36).
> **Example:** The number of children in a family can be presented with frequencies and cumulative relative frequencies, showing that 74.6% of families have two children or fewer [36](#page=36).
#### 3.1.4 Numerical continuous variables
For continuous numerical variables, grouping into intervals of equal width is necessary to create meaningful tables. Frequencies, relative frequencies, and cumulative relative frequencies are then calculated for these groups [37](#page=37).
* **Grouped data:** Continuous data like birth weight can be grouped into ranges (e.g., 2000-2499 grams) [37](#page=37).
* **Open-ended groups:** Sometimes, the first or last groups are defined as "less than a specific value" or "greater than a specific value" to handle data at the extremes [37](#page=37).
#### 3.1.5 Two categorical variables
Presenting two categorical variables together is achieved using a two-way table, also known as cross-tabulation [38](#page=38).
* **Cross-tabulation:** This table displays the frequencies of the joint occurrence of categories from two variables. It allows for the calculation of marginal totals (row and column sums) and cell counts [38](#page=38).
* **Percentages:** Tables can be enhanced by including percentages calculated by row or by column, providing insights into conditional relationships between the variables [38](#page=38) [39](#page=39).
> **Example:** A two-way table showing disease status by sex can reveal the proportion of males within the diseased group or the proportion of diseased individuals among males [38](#page=38).
#### 3.1.6 Three categorical variables
Three categorical variables can be presented in a three-way table. The arrangement of variables can be altered to highlight specific relationships [40](#page=40).
### 3.2 Graphical presentation of data
Appropriate graphs enhance data understanding and clarity [41](#page=41).
#### 3.2.1 Nominal variables
* **Pie chart:** Represents the whole as a circle divided into sectors, where the area of each sector corresponds to the frequency of a category [41](#page=41).
> **Tip:** Pie charts are less common in scientific papers due to limitations, especially with binary variables or many categories [41](#page=41).
* **Bar graph:** A versatile graph for categorical variables, which can be vertical or horizontal, displaying frequencies or percentages. Categories can be ordered by frequency for better visual appeal [41](#page=41) [42](#page=42).
#### 3.2.2 Ordinal variables
* **Pie chart:** Can be used for ordinal variables, similar to nominal ones [43](#page=43).
* **Bar graph:** Often the preferred method for ordinal variables, maintaining the natural order of categories [43](#page=43).
* **Stacked bar plot:** Useful for Likert scale data, allowing comparison of opinions across different groups or questions [44](#page=44).
#### 3.2.3 Two categorical variables
* **Bar plot:** Side-by-side or segmented (stacked) bar plots are suitable for visualizing the relationship between two categorical variables [45](#page=45).
#### 3.2.4 Numerical variables
* **Histogram:** Similar to a bar chart but with no gaps between bars, indicating a continuous variable. Each bar represents a range of values, and its height reflects the frequency within that range [46](#page=46).
* **Box plot (Box and whisker plot):** Summarizes numerical data using the five-number summary: minimum, maximum, median, first quartile (Q1), and third quartile (Q3) [46](#page=46).
* The box displays the interquartile range (IQR = Q3 - Q1), with the median as a line inside [46](#page=46).
* Whiskers extend to the minimum and maximum values within 1.5 times the IQR from the quartiles [47](#page=47).
* Outliers are data points falling outside this range [47](#page=47).
* **Side-by-side box plots** are used to compare the distribution of a numerical variable across different groups [47](#page=47).
#### 3.2.5 Two numerical variables
* **Scatter plot:** Used to visualize the relationship between two numerical or ordinal variables. Each point represents an individual case, plotted against the values of the two variables on the horizontal and vertical axes [48](#page=48).
#### 3.2.6 Summary of graph selection
* One categorical variable: Bar chart or pie chart [48](#page=48).
* One numerical variable: Histogram or box plot [48](#page=48).
* Two categorical variables: Side-by-side or stacked bar charts [48](#page=48).
* Two numerical variables: Scatter plot [48](#page=48).
* One numerical and one categorical variable: Side-by-side box plot [48](#page=48).
### 3.3 Hypothesis testing
Hypothesis testing is a statistical method used to make decisions about a research question based on data [49](#page=49).
#### 3.3.1 The research question
A research question should be specific, answerable, novel, and relevant to medical knowledge [49](#page=49).
#### 3.3.2 Steps for hypothesis testing
1. Define the null and alternative hypotheses [49](#page=49).
2. Choose the level of significance [49](#page=49).
3. Select an appropriate statistical test and compute the test statistic [49](#page=49).
4. Compute the p-value [49](#page=49).
5. Compare the p-value to the level of significance to decide whether to reject the null hypothesis [49](#page=49).
6. Draw a conclusion [49](#page=49).
#### 3.3.3 The null and alternative hypotheses
* **Null hypothesis ($H_0$)**: Represents the currently accepted belief or idea, stating that there is no difference, no association, or nothing is happening. The researcher may doubt its truth [50](#page=50).
* **Alternative hypothesis ($H_1$ or $H_a$)**: Represents the researcher's idea, suggesting that something is happening, there is a difference, or there is an association. The researcher believes this to be true and aims to prove it [50](#page=50).
These hypotheses are mutually exclusive; only one can be true [50](#page=50).
> **Example:**
> * **Research Question:** Is there a difference in exam scores between males and females?
> * $H_0$: Mean score of males = Mean score of females (or Mean difference = 0) [50](#page=50).
> * $H_1$: Mean score of males ≠ Mean score of females (or Mean difference ≠ 0) [50](#page=50).
After data analysis, a decision is made regarding the null hypothesis: either to "fail to reject" it (implying no sufficient evidence for the alternative) or to "reject" it (implying support for the alternative hypothesis) [51](#page=51).
#### 3.3.4 One-tailed and two-tailed tests
The type of test depends on the alternative hypothesis [51](#page=51).
* **Two-tailed tests:** The alternative hypothesis allows for a difference in either direction (e.g., drug A is not equal to drug B). The rejection region is split between both tails of the distribution [51](#page=51) [52](#page=52).
* Example $H_1$: drug A $\neq$ drug B [51](#page=51).
* **One-tailed tests:** The alternative hypothesis specifies a particular direction for the difference (e.g., drug A is better than drug B). The rejection region is located in only one tail of the distribution [51](#page=51) [52](#page=52).
* Example $H_1$: drug A $>$ drug B [51](#page=51).
Two-tailed tests are generally preferred unless there is a strong a priori justification for a one-tailed test [52](#page=52).
#### 3.3.5 Type I and Type II errors
Errors can occur during hypothesis testing [54](#page=54).
* **Type I error (False positive, $\alpha$)**: Rejecting a true null hypothesis. This means concluding a difference or effect exists when it does not [54](#page=54) [55](#page=55).
* The probability of a Type I error is the level of significance, typically set at $\alpha = 0.05$ (5%) or more conservatively at $\alpha = 0.01$ (1%) [55](#page=55) [56](#page=56).
* **Type II error (False negative, $\beta$)**: Failing to reject a false null hypothesis. This means concluding no difference or effect exists when one actually does [55](#page=55) [56](#page=56).
* The probability of a Type II error is typically set around $\beta = 0.2$ (20%) [56](#page=56).
> **Tip:** Type I errors are generally considered more serious as they lead to false positive conclusions, potentially misinterpreting drug effectiveness or risk factors [55](#page=55).
The probabilities of Type I and Type II errors are inversely related [56](#page=56).
#### 3.3.6 Level of significance
The **level of significance ($\alpha$)** is the maximum acceptable probability of committing a Type I error. A smaller $\alpha$ reduces the risk of a Type I error but increases the risk of a Type II error. The choice of $\alpha$ depends on the consequences of making a Type I error. Common values are 0.05 and 0.01 [56](#page=56).
---
# P-values, confidence intervals, and epidemiological measures
This section delves into the interpretation of p-values for statistical significance, explores the concept and application of confidence intervals, and introduces key epidemiological measures like incidence and prevalence [57](#page=57) [58](#page=58) [59](#page=59) [60](#page=60) [61](#page=61) [62](#page=62) [63](#page=63) [64](#page=64) [65](#page=65) [66](#page=66) [67](#page=67) [68](#page=68) [69](#page=69) [70](#page=70) [71](#page=71) [72](#page=72) [73](#page=73) [74](#page=74) [75](#page=75) [76](#page=76) [77](#page=77) [78](#page=78).
### 4.1 P-values and statistical significance
The p-value, standing for probability, quantifies the likelihood of observing the obtained results, or more extreme results, if the null hypothesis were true. It is a measure of the strength of evidence against the null hypothesis. A p-value is always between 0 and 1 [58](#page=58).
A commonly used significance level, denoted as $\alpha$ (alpha), is 0.05 or 5% [58](#page=58).
* **Decision Rule:**
* If the observed p-value is less than $\alpha$ ($p < \alpha$), the null hypothesis is rejected, indicating statistical significance [58](#page=58).
* If the observed p-value is greater than or equal to $\alpha$ ($p \ge \alpha$), the null hypothesis is not rejected, indicating a lack of statistical significance [58](#page=58).
#### 4.1.1 Interpreting p-values
* A statistically significant result ($p < 0.05$) suggests that the observed data is unlikely to have occurred by chance alone if the null hypothesis were true. This provides evidence against the null hypothesis [60](#page=60).
* A non-statistically significant result ($p \ge 0.05$) indicates that the observed data is consistent with the null hypothesis and could likely have occurred by chance. This provides evidence in favor of the null hypothesis, or rather, insufficient evidence to reject it [60](#page=60).
#### 4.1.2 Reporting p-values
It is crucial to report the actual p-value rather than simply stating "P<0.05" or "P≥0.05". If a statistical program outputs a very small p-value, such as 0.000, it should be reported as $p < 0.001$. Similarly, if the p-value is very close to 1, it should be reported as $p > 0.999$ [59](#page=59).
#### 4.1.3 Clinical significance versus statistical significance
While statistical significance indicates whether an observed effect is likely due to chance, clinical significance relates to the practical importance of the effect in a real-world context [61](#page=61).
* A very large sample size can lead to statistically significant results even for small differences that lack clinical importance [61](#page=61).
* Conversely, a small sample size might result in a non-statistically significant finding (due to low study power) even if the observed difference is clinically important [61](#page=61).
> **Tip:** Always consider both statistical and clinical significance when interpreting study findings.
### 4.2 Confidence intervals
A confidence interval (CI) is a range of values that is likely to contain the true population parameter. It provides a measure of uncertainty around an estimate derived from a sample [63](#page=63) [72](#page=72).
#### 4.2.1 Interpretation of confidence intervals
* **Common interpretation (less precise):** For a 95% CI, it means that we are 95% confident that the true population parameter lies within the calculated range [64](#page=64).
* **Scientifically precise interpretation:** If the same study procedure were repeated an infinite number of times, 95% of the constructed confidence intervals would contain the true population parameter. In practice, a study is usually conducted only once [64](#page=64).
#### 4.2.2 Factors affecting confidence intervals
* **Sample size:** A larger sample size leads to a narrower confidence interval, indicating a more precise estimate. A smaller sample size results in a wider confidence interval [65](#page=65).
* **Confidence level:** A higher confidence level (e.g., 99% compared to 95%) results in a wider confidence interval, reflecting a greater certainty that the true parameter is captured. A lower confidence level leads to a narrower interval but with less certainty [66](#page=66).
* **Variability in the data (Standard Error):** Higher variability, indicated by a larger standard error (SE), results in a wider confidence interval. A smaller standard error leads to a narrower interval [72](#page=72).
#### 4.2.3 Confidence intervals for different parameters
* **Single mean:** The 95% CI for a mean is calculated as:
$$ \text{Sample mean} \pm (1.96 \times \text{SE}) $$ [72](#page=72).
where SE is the standard error of the mean [72](#page=72).
* **Proportion:** A CI can also be calculated for a proportion, indicating a range for the true population proportion. For example, a 95% CI for smoking prevalence might be 10% to 14% [66](#page=66).
* **Difference between two means:** If the CI for the difference between two means includes zero, it suggests there is no statistically significant difference between the population means. If the CI does not include zero, the difference is considered statistically significant [67](#page=67).
* **Ratios (Risk Ratios - RR, Odds Ratios - OR):** For ratios like RR and OR, the confidence interval is interpreted by checking if it contains one. If the CI contains 1, there is no significant difference in risk or odds between the groups. If it does not contain 1, the difference is significant [69](#page=69).
> **Tip:** It is generally recommended to report the confidence interval alongside the p-value for a more complete interpretation of findings [73](#page=73).
### 4.3 Epidemiological measures: Incidence and prevalence
Incidence and prevalence are fundamental epidemiological measures used to describe the occurrence of diseases and health conditions in populations [74](#page=74).
#### 4.3.1 Incidence
Incidence refers to the occurrence of *new* cases of a disease or health condition within a specific population during a defined period [74](#page=74).
* **Incidence proportion (Cumulative Incidence):** This is the proportion of a disease-free population at the start of a period that develops the disease during that period. It is also known as risk [74](#page=74).
$$ \text{Incidence proportion} = \frac{\text{number of new cases during a specific period}}{\text{population free of disease at the beginning of the period}} $$ [74](#page=74).
It ranges from 0 to 1 (0% to 100%) and is unitless [74](#page=74).
* **Incidence rate (Person-time rate):** This is used when individuals have different follow-up times. It accounts for the total time contributed by all individuals in the population at risk.
$$ \text{Incidence rate} = \frac{\text{number of new cases at a specific period}}{\text{sum of follow-up times for all persons}} $$ [74](#page=74).
The denominator is typically expressed in person-time units, such as person-years [74](#page=74).
#### 4.3.2 Prevalence
Prevalence measures the proportion of individuals in a population who have a specific disease or health condition at a *single point in time* [76](#page=76).
* **Point prevalence:**
$$ \text{Prevalence} = \frac{\text{number of new and existing cases at a specific point in time}}{\text{size of the population at that time point}} $$ [76](#page=76).
Prevalence is a proportion, ranges from 0 to 1 (0% to 100%), and is unitless [76](#page=76).
* **Period prevalence:** This is the proportion of a population that has a disease at any time during an observation period, including pre-existing cases that persist into the period [76](#page=76).
#### 4.3.3 Relationship between incidence and prevalence
Prevalence is influenced by both incidence and the duration of the disease [77](#page=77).
* **Higher incidence** leads to **higher prevalence**.
* **Longer disease duration** (due to slower cure or lower mortality) leads to **higher prevalence**.
* **Faster cure** or **higher mortality** leads to **lower prevalence**.
> **Tip:** Incidence measures the rate of *new* cases, while prevalence measures the *burden* of disease (new and existing cases) at a given time.
**Summary Table:**
| Measure | Type | Range | Numerator | Denominator | Unit |
| :---------------------- | :---------- | :---------------------------------- | :------------------------- | :----------------------- | :-------- |
| Incidence proportion | Proportion | 0-1 (0-100%) | New cases | Population at risk | Unitless |
| Incidence rate | Rate | 0 - $\infty$ | New cases | Person-time at risk | 1/Time |
| Prevalence (Point) | Proportion | 0-1 (0-100%) | New and existing cases | Total population | Unitless |
| Prevalence (Period) | Proportion | 0-1 (0-100%) | Cases during the period | Total population | Unitless |
---
## Common mistakes to avoid
- Review all topics thoroughly before exams
- Pay attention to formulas and key definitions
- Practice with examples provided in each section
- Don't memorize without understanding the underlying concepts
Glossary
| Term | Definition |
|------|------------|
| Statistics | The science concerned with developing and studying methods for collecting, analyzing, interpreting, and presenting data. |
| Biostatistics | The application of statistical principles in the fields of medicine, public health, and biology. |
| Data variable | Something that varies or differs from person to person or group to group; these are the items for which data is collected. |
| Categorical variables | Variables that are qualitative in nature and can be classified into categories; they do not have a unit of measurement. |
| Nominal variables | Categorical variables that have no intrinsic order; categories can be arranged in any sequence. |
| Ordinal variables | Categorical variables that have an order, and this order has a meaningful interpretation. |
| Numerical variables | Variables that are measured or counted and are presented in numbers; they have a measurement unit. |
| Discrete variables | Numerical variables that can only take integer values (no decimals) and usually represent a count of something. |
| Continuous variables | Numerical variables that can take any real numerical value, including decimals, and involve measurement. |
| Dichotomous variable (Binomial variable) | A categorical variable with only two categories, such as sex (male/female) or disease status (diseased/not diseased). |
| Explanatory variable | A variable that is thought to affect or predict another variable; also known as an independent or predictor variable. |
| Response variable | A variable that is affected by or depends on another variable; also known as a dependent or outcome variable. |
| Data entry | The process of preparing collected data into a suitable computer file, typically arranged in a spreadsheet format. |
| Missing data | Values that are absent for a variable in a dataset, which can occur for various reasons and require specific handling. |
| Descriptive statistics | Statistical methods used to numerically describe and summarize data, including measures of central tendency and dispersion. |
| Frequencies | The number of times each category or value appears in a dataset. |
| Relative frequencies (Percentages) | The proportion of individuals in each category, expressed as a percentage of the total. |
| Measures of central tendency | Statistics that describe the center or typical value of a dataset, such as the mean, median, and mode. |
| Mean | The sum of all observed values divided by the number of observations; also known as the average. |
| Median | The middle value in a dataset when arranged in order; half of the data points are above it, and half are below it. |
| Mode | The value that occurs most frequently in a dataset. |
| Measures of dispersion | Statistics that describe the spread or variability of data, such as range, inter-quartile range, variance, and standard deviation. |
| Range | The difference between the largest and smallest values in a dataset. |
| Inter-quartile range (IQR) | The difference between the third quartile (Q3) and the first quartile (Q1), representing the middle 50% of the data. |
| Variance ($s^2$) | A measure of spread that represents the average of the squared differences from the mean; it is in square units. |
| Standard deviation (s) | The square root of the variance, representing the average distance of data values from their mean; it has the same units as the data. |
| Coefficient of variation (CV) | A measure that expresses the standard deviation as a proportion of the mean, used to compare variability across different scales. |
| Weighted mean | A type of average where each data point contributes differently to the final average, based on assigned weights. |
| Normal distribution | A symmetrical probability distribution characterized by a bell shape, where the mean, median, and mode are equal, and data clusters around the mean. |
| Standard normal distribution | A normal distribution with a mean of 0 and a standard deviation of 1. |
| Z-score | A standardized score that indicates the number of standard deviations a data point is from the mean; calculated as $z = (x - \mu) / \sigma$. |
| Skewed data | Data that is not symmetrically distributed around the mean; it has a long tail on one side. |
| Positive skew (Right skew) | A distribution where the long tail is on the right side, meaning there are more high values or outliers. |
| Negative skew (Left skew) | A distribution where the long tail is on the left side, meaning there are more low values or outliers. |
| Cross-tabulation (Two-way table) | A table that displays the frequencies or percentages of two categorical variables simultaneously, showing their relationship. |
| Histogram | A graphical representation of the distribution of numerical data, where bars represent the frequency of data within specified intervals. |
| Box plot (Box and whisker plot) | A graphical display that summarizes numerical data using the five-number summary (minimum, Q1, median, Q3, maximum) and highlights outliers. |
| Scatter plot | A graphical representation used to display the relationship between two numerical variables, with each point representing a case. |
| Hypothesis testing | A statistical method used to make decisions about a population based on sample data, involving formulating and testing hypotheses. |
| Null hypothesis ($H_0$) | A statement of no effect or no difference, which the researcher aims to disprove. |
| Alternative hypothesis ($H_1$ or $H_a$) | A statement that contradicts the null hypothesis, representing the researcher's claim or idea. |
| One-tailed test | A statistical test where the alternative hypothesis specifies a direction (greater than or less than). |
| Two-tailed test | A statistical test where the alternative hypothesis does not specify a direction (not equal to). |
| Type I error (False positive) | The error of rejecting a true null hypothesis. The probability of this error is denoted by $\alpha$. |
| Type II error (False negative) | The error of failing to reject a false null hypothesis. The probability of this error is denoted by $\beta$. |
| Level of significance ($\alpha$) | The maximum allowed probability of committing a Type I error. |
| Power of a test | The probability of correctly rejecting a false null hypothesis ($1-\beta$). |
| P-value | The probability of obtaining observed results, or more extreme results, if the null hypothesis were true. Used to decide whether to reject the null hypothesis. |
| Statistical significance | A result is considered statistically significant if the p-value is less than the chosen level of significance (typically $\alpha = 0.05$). |
| Clinical significance | Whether a statistically significant finding has practical importance or relevance in a clinical setting. |
| Confidence Interval (CI) | A range of values, calculated from sample data, that is likely to contain the true population parameter with a certain level of confidence. |
| Standard error (SE) | The standard deviation of the sampling distribution of a statistic, typically the standard deviation of sample means. |
| Incidence | The occurrence of new cases of a disease or health condition in a population over a specific period. |
| Incidence proportion (Cumulative incidence) | The proportion of a population at risk that develops a disease during a specific period. It is equivalent to risk. |
| Incidence rate (Person-time rate) | The rate at which new cases occur over a period of time, taking into account varying follow-up times for individuals, expressed per person-time. |
| Prevalence | The percentage of people in a population who have a disease or health condition at a specific point in time (point prevalence) or during an observation period (period prevalence). |
| Odds Ratio (OR) | A measure of association between an exposure and an outcome, calculated as the ratio of the odds of the outcome in the exposed group to the odds of the outcome in the unexposed group. |
| Risk Ratio (RR) | A measure of association between an exposure and an outcome, calculated as the ratio of the risk of the outcome in the exposed group to the risk of the outcome in the unexposed group. |