How do you test the significant difference between two sets of data?

Oftentimes we would want to compare sets of samples. Such comparisons include if wild-type samples have different expression compared to mutants or if healthy samples are different from disease samples in some measurable feature [blood count, gene expression, methylation of certain loci]. Since there is variability in our measurements, we need to take that into account when comparing the sets of samples. We can simply subtract the means of two samples, but given the variability of sampling, at the very least we need to decide a cutoff value for differences of means; small differences of means can be explained by random chance due to sampling. That means we need to compare the difference we get to a value that is typical to get if the difference between two group means were only due to sampling. If you followed the logic above, here we actually introduced two core ideas of something called “hypothesis testing”, which is simply using statistics to determine the probability that a given hypothesis [Ex: if two sample sets are from the same population or not] is true. Formally, expanded version of those two core ideas are as follows:

  1. Decide on a hypothesis to test, often called the “null hypothesis” [\[H_0\]]. In our case, the hypothesis is that there is no difference between sets of samples. An “alternative hypothesis” [\[H_1\]] is that there is a difference between the samples.
  2. Decide on a statistic to test the truth of the null hypothesis.
  3. Calculate the statistic.
  4. Compare it to a reference value to establish significance, the P-value. Based on that, either reject or not reject the null hypothesis, \[H_0\].

There is one intuitive way to go about this. If we believe there are no differences between samples, that means the sample labels [test vs. control or healthy vs. disease] have no meaning. So, if we randomly assign labels to the samples and calculate the difference of the means, this creates a null distribution for \[H_0\] where we can compare the real difference and measure how unlikely it is to get such a value under the expectation of the null hypothesis. We can calculate all possible permutations to calculate the null distribution. However, sometimes that is not very feasible and the equivalent approach would be generating the null distribution by taking a smaller number of random samples with shuffled group membership.

Below, we are doing this process in R. We are first simulating two samples from two different distributions. These would be equivalent to gene expression measurements obtained under different conditions. Then, we calculate the differences in the means and do the randomization procedure to get a null distribution when we assume there is no difference between samples, \[H_0\]. We then calculate how often we would get the original difference we calculated under the assumption that \[H_0\] is true. The resulting null distribution and the original value is shown in Figure 3.9.

set.seed[100]
gene1=rnorm[30,mean=4,sd=2]
gene2=rnorm[30,mean=2,sd=2]
org.diff=mean[gene1]-mean[gene2]
gene.df=data.frame[exp=c[gene1,gene2],
                  group=c[ rep["test",30],rep["control",30] ] ]


exp.null org.diff]/length[exp.null[,1]]
p.val

## [1] 0.001

After doing random permutations and getting a null distribution, it is possible to get a confidence interval for the distribution of difference in means. This is simply the \[2.5th\] and \[97.5th\] percentiles of the null distribution, and directly related to the P-value calculation above.

We can also calculate the difference between means using a t-test. Sometimes we will have too few data points in a sample to do a meaningful randomization test, also randomization takes more time than doing a t-test. This is a test that depends on the t distribution. The line of thought follows from the CLT and we can show differences in means are t distributed. There are a couple of variants of the t-test for this purpose. If we assume the population variances are equal we can use the following version

\[t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\] where \[s_{X_1X_2} = \sqrt{\frac{[n_1-1]s_{X_1}^2+[n_2-1]s_{X_2}^2}{n_1+n_2-2}}\] In the first equation above, the quantity is t distributed with \[n_1+n_2-2\] degrees of freedom. We can calculate the quantity and then use software to look for the percentile of that value in that t distribution, which is our P-value. When we cannot assume equal variances, we use “Welch’s t-test” which is the default t-test in R and also works well when variances and the sample sizes are the same. For this test we calculate the following quantity:

\[t = \frac{\overline{X}_1 - \overline{X}_2}{s_{\overline{X}_1 - \overline{X}_2}}\] where \[s_{\overline{X}_1 - \overline{X}_2} = \sqrt{\frac{s_1^2 }{ n_1} + \frac{s_2^2 }{n_2}}\]

and the degrees of freedom equals to

\[\mathrm{d.f.} = \frac{[s_1^2/n_1 + s_2^2/n_2]^2}{[s_1^2/n_1]^2/[n_1-1] + [s_2^2/n_2]^2/[n_2-1]} \]

Luckily, R does all those calculations for us. Below we will show the use of t.test[] function in R. We will use it on the samples we simulated above.

# Welch's t-test
stats::t.test[gene1,gene2]

## 
##  Welch Two Sample t-test
## 
## data:  gene1 and gene2
## t = 3.7653, df = 47.552, p-value = 0.0004575
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.872397 2.872761
## sample estimates:
## mean of x mean of y 
##  4.057728  2.185149

# t-test with equal variance assumption
stats::t.test[gene1,gene2,var.equal=TRUE]

## 
##  Two Sample t-test
## 
## data:  gene1 and gene2
## t = 3.7653, df = 58, p-value = 0.0003905
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8770753 2.8680832
## sample estimates:
## mean of x mean of y 
##  4.057728  2.185149

A final word on t-tests: they generally assume a population where samples coming from them have a normal distribution, however it is been shown t-test can tolerate deviations from normality, especially, when two distributions are moderately skewed in the same direction. This is due to the central limit theorem, which says that the means of samples will be distributed normally no matter the population distribution if sample sizes are large.

We should think of hypothesis testing as a non-error-free method of making decisions. There will be times when we declare something significant and accept \[H_1\] but we will be wrong. These decisions are also called “false positives” or “false discoveries”, and are also known as “type I errors”. Similarly, we can fail to reject a hypothesis when we actually should. These cases are known as “false negatives”, also known as “type II errors”.

The ratio of true negatives to the sum of true negatives and false positives [\[\frac{TN}{FP+TN}\]] is known as specificity. And we usually want to decrease the FP and get higher specificity. The ratio of true positives to the sum of true positives and false negatives [\[\frac{TP}{TP+FN}\]] is known as sensitivity. And, again, we usually want to decrease the FN and get higher sensitivity. Sensitivity is also known as the “power of a test” in the context of hypothesis testing. More powerful tests will be highly sensitive and will have fewer type II errors. For the t-test, the power is positively associated with sample size and the effect size. The larger the sample size, the smaller the standard error, and looking for the larger effect sizes will similarly increase the power.

The general summary of these different decision combinations are included in the table below.

Accept \[H_0\] [claim that the gene is not differentially expressed]True Negatives [TN]False Negatives [FN] ,type II error\[m_0\]: number of truly null hypothesesreject \[H_0\] [claim that the gene is differentially expressed]False Positives [FP] ,type I errorTrue Positives [TP]\[m-m_0\]: number of truly alternative hypotheses

We expect to make more type I errors as the number of tests increase, which means we will reject the null hypothesis by mistake. For example, if we perform a test at the 5% significance level, there is a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, if we make 1000 tests where all null hypotheses are true for each of them, the average number of incorrect rejections is 50. And if we apply the rules of probability, there is almost a 100% chance that we will have at least one incorrect rejection. There are multiple statistical techniques to prevent this from happening. These techniques generally push the P-values obtained from multiple tests to higher values; if the individual P-value is low enough it survives this process. The simplest method is just to multiply the individual P-value [\[p_i\]] by the number of tests [\[m\]], \[m \cdot p_i\]. This is called “Bonferroni correction”. However, this is too harsh if you have thousands of tests. Other methods are developed to remedy this. Those methods rely on ranking the P-values and dividing \[m \cdot p_i\] by the rank, \[i\], :\[\frac{m \cdot p_i }{i}\], which is derived from the Benjamini–Hochberg procedure. This procedure is developed to control for “False Discovery Rate [FDR]” , which is the proportion of false positives among all significant tests. And in practical terms, we get the “FDR-adjusted P-value” from the procedure described above. This gives us an estimate of the proportion of false discoveries for a given test. To elaborate, p-value of 0.05 implies that 5% of all tests will be false positives. An FDR-adjusted p-value of 0.05 implies that 5% of significant tests will be false positives. The FDR-adjusted P-values will result in a lower number of false positives.

One final method that is also popular is called the “q-value” method and related to the method above. This procedure relies on estimating the proportion of true null hypotheses from the distribution of raw p-values and using that quantity to come up with what is called a “q-value”, which is also an FDR-adjusted P-value [Storey and Tibshirani 2003]. That can be practically defined as “the proportion of significant features that turn out to be false leads.” A q-value 0.01 would mean 1% of the tests called significant at this level will be truly null on average. Within the genomics community q-value and FDR adjusted P-value are synonymous although they can be calculated differently.

In R, the base function

p.val=sum[exp.null[,1]>org.diff]/length[exp.null[,1]]
p.val
0 implements most of the p-value correction methods described above. For the q-value, we can use the
p.val=sum[exp.null[,1]>org.diff]/length[exp.null[,1]]
p.val
1 package from Bioconductor. Below we demonstrate how to use them on a set of simulated p-values. The plot in Figure 3.10 shows that Bonferroni correction does a terrible job. FDR[BH] and q-value approach are better but, the q-value approach is more permissive than FDR[BH].

library[qvalue]
data[hedenfalk]

qvalues 

Chủ Đề