As discussed in the previous papers, data collected in clinical research are used to validate a clinical statement (support or refute a hypothesis). To proceed with hypothesis testing, we first need to understand the nature of the data we are working with. These data can be paired or unpaired. Here is an example to understand these two terms. Imagine we are studying patients with knee pain who are being treated with a new injection procedure. We term them the case group. To help a clinician understand the effectiveness of this new procedure, we need to compare visual analog scale (VAS) scores for pain reported by the patients before and after the procedure. We also need to compare these data with those of an existing treatment to understand if the new treatment has better, worse, or the same effect on patient pain outcomes. For this, we need another group of patients receiving the already existing treatment. We term them the control group. Now the data we get from the case group before and after the procedure are called paired data. If we compare data from the control group with those from the case group, it is called unpaired data. In the prior paper, we discussed testing independent samples, that is, unpaired data. Now, we will consider testing paired samples. Paired samples are related (e.g., before and after measurements). Because each subject acts as their own control, influences from external factors like age or sex are minimized, improving statistical precision. As discussed earlier, the null hypothesis assumes that there is no difference between the two groups under comparison – in this case, the difference between the paired samples is 0. The alternate hypothesis assumes that the difference between the two samples is significantly different from 0. Scenarios of paired data: Blood pressure in both arms of the same patient measured during the same visit Crossover trials where the same patient receives both a placebo and an active drug at different time points in the same study. Pre-and post study – comparison of outcomes before and after an intervention in the same patient. Once we determine if we are working with paired or unpaired data, the next step is to check for the normality of data distribution. Data distribution, variance, and statistical tests used for unpaired data (also called independent samples) were discussed in a previous paper in this series.1-3 In this review, we rely on statistical tests used to analyze paired data. With parametric data, we use the paired t test, and with nonparametric data, we can use the Wilcoxon signed-rank test or the Sign test. We will delve into each of these tests in the following discussion. Paired t Test The paired-sample t test is used to analyze paired data. Observations are collected in pairs and are not independent. It determines whether the mean difference between paired observations is significantly different from 0. The paired t test assumes that the differences between paired observations are normally distributed (along a bell-shaped curve) and that the sample size is large enough (usually considered more than 30). The steps to obtain a P value are like that for independent samples – calculating the t-statistic and degrees of freedom. The first step is to calculate the differences between paired observations for each subject or unit. Then, the mean and standard deviation of these differences are computed, and a t-statistic is determined from that. We have a software to do this for us. Finally, the values of t-statistic and degrees of freedom are compared to a critical value from the t-distribution to determine the statistical significance. Again, statistical software can be used to perform this. The output informs if there is a significant difference between the paired observations. If the calculated t-statistic falls within the critical region, the null hypothesis is rejected, indicating that there is a significant difference between the paired means. The paired-sample t test can also be used to construct confidence intervals for the true mean difference between paired observations. Confidence intervals provide a range of values within which the true population parameter is likely to lie, with a specified level of confidence. Why do we need these in medicine? At no one time is anyone’s blood pressure going to be same as an hour or even 10 min before. So, we set a range within which the value should be normal. Confidence intervals also tell us how wide the variation in groups is. The limitation of the paired t test is that it relies on the assumption that the differences between paired observations follow a normal distribution, which might not always be true. Also, it might not work well with small sample sizes or if the differences between pairs are different from each other (as in a nonparametric distribution). For example, consider our mammal sleep dataset, which presents 83 paired data corresponding to the NREM sleep time and REM sleep time of mammals. The null hypothesis of interest is that there is no difference between the NREM sleep time and REM sleep time of the mammals, and the alternate hypothesis is that there is a difference between the NREM sleep time and REM sleep time of the mammals. Paired t test is performed for the paired data. The P value of the test is 2.2 × 10⁻¹⁶, which is ≤ the significance level 0.05. Thus, we reject the null hypothesis and conclude that there is a difference between the non rapid eye movement (NREM) sleep time and rapid eye movement (REM) sleep time of the mammals Tables 1–3.Table 1: Test differencesTable 2: Tests detailedTable 3: Test formulaeThe histogram of paired difference of NREM Sleep and REM Sleep along with the mean and standard deviation is given in the plot Figure 1.Figure 1: Histogram of paired differencesWhat to do if the data are not normally distributed? This can also happen. As researchers, we have no way of controlling every detail of what subjects do, but we have tests that can account for addressing the problem posed by variance. Sign Test The sign test is a statistical method used to compare paired-sample data with a nonparametric distribution. It determines whether the median of the differences between paired observations is 0. This contrasts with the t tests which use means of the data. If there is no difference between the paired observations, the median of the data reaches 0. If there is a difference, it is either greater than or lesser than 0, meaning that the output shows a direction. Let us consider the example of VAS scores in patients with knee pain. A higher VAS score indicates a higher intensity of pain and vice versa. When we subtract the post-intervention VAS score from the pre-intervention VAS score, a positive difference shows improvement in pain and a negative difference shows worsening of pain. The sign test, as the name implies, accounts for this positive or negative sign and how far it is from zero (zero means no change in VAS score). However, it does not account for the magnitude of change as it considers the median but not the mean of the differences. Because it doesn’t consider the magnitude of the change, the sign test is the least powerful of the three, but it’s also the most useful when assumptions of normality and sample size are violated. It is particularly useful when data do not meet the assumptions required for parametric tests like the t test, such as when the data are not normally distributed or when the sample size is small. It is often used in situations where the paired differences are ordinal data. In the sign test, instead of comparing the actual values of the paired observations, only the signs/integer value (positive or negative) of the differences between pairs are considered. To conduct the sign test, the difference between the paired observations are calculated and it is record if each difference is positive, negative, or 0. Then, the number of positive and negative differences is determined. If the alternative hypothesis suggests that one condition is likely to produce higher values than the other, the test statistic is computed as the number of positive differences. The critical values for the sign test are determined from a discrete null distribution. If the output value falls within the critical region, it indicates that there is evidence to reject the null hypothesis and support the alternative hypothesis. Simply put, the P value is significant. Since the sign test does not consider the magnitude of differences of observational pairs, this may lead to loss of information and potentially reduces the test’s ability to detect meaningful differences between groups. Let us look at another example from our mammal sleep dataset. It presents 83 paired data corresponding to the NREM sleep time and REM sleep time of mammals. The null hypothesis of interest is that there is no difference between the NREM sleep time and REM sleep time of the mammals, and the alternate hypothesis is that there is a difference between the NREM sleep time and REM sleep time of the mammals. The Sign test is performed for the paired data. The P value of the test is 1.907 × 10⁻⁶, which is less than the significance level 0.05. Thus, we reject the null hypothesis and conclude that there is a difference between the NREM sleep time and REM sleep time of the mammals. Wilcoxon Signed-Rank Test. The Wilcoxon Signed-Rank test is a nonparametric method used for comparing paired samples. Here, the differences between paired observations are considered. It integrates the fundamental concepts of the sign test, focusing on the signs of differences, and of the paired t test, assessing the magnitudes of differences. It does not assume that the differences between paired observations are normally distributed. The Wilcoxon signed-rank test is robust to outliers and is suitable for non-normally distributed data or when the sample size is too small (usually considered less than 30). In the Wilcoxon Signed-Rank Test, the differences between paired observations are calculated. The absolute values of these differences are then ranked, and the signs are retained to produce signed ranks. The sum of the absolute values of positive ranks and that of negative ranks is calculated. The smaller absolute value of these two is the W-statistic. If this value falls within the critical region, it indicates that there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis, suggesting a significant difference between the paired observations. The Wilcoxon Signed-Rank test may have lower statistical power compared to parametric tests, particularly with small sample sizes. Interpreting results from the Wilcoxon test can be more complex than with parametric tests because it does not offer effect size estimates like the t test. Moreover, it requires ordinal or interval data and is not appropriate for nominal data. However, this test is more appropriate when the assumptions of a paired t test like normality and large sample size are not met. Let us look at this using our mammal sleep dataset which presents 83 paired data corresponding to the NREM sleep time and REM sleep time of mammals. The null hypothesis of interest is that there is no difference between the NREM sleep time and REM sleep time of the mammals, and the alternate hypothesis is that there is a difference between the NREM sleep time and REM sleep time of the mammals. Wilcoxon signed-rank test is performed for the paired data. The P value of the test is 1.139 × 10⁻¹¹, which is less than the significance level of 0.05. Thus, we reject the null hypothesis and conclude that there is a difference between the NREM sleep time and REM sleep time of the mammals. Limitations of the analysis of paired data: Many studies in the life sciences use paired data designs such as pre-and post-intervention measurements to evaluate the effect of an intervention. While statistically efficient, these designs have certain limitations. First, interpreting changes over time can be challenging as any observed differences may be due to unrelated temporal factors, not the intervention itself. Without a concurrent control group, it is hard to separate the true effect of treatment from natural recovery or external influences. Second, paired tests typically summarize the differences between paired values using a mean or median. If individual responses vary widely — with some improving and others worsening — the average difference may appear small, potentially underestimating the true variation. Finally, as with most group-level analyses, paired tests can obscure individual or subgroup-level variability. While not unique to paired methods, this limitation becomes especially relevant when treatment responses differ substantially between individuals. Financial support and sponsorship Nil. Conflicts of interest There are no conflicts of interest.
Nandakumar et al. (Thu,) studied this question.