When performing a test
for the detection of positive selection (i.e. with CodeML/PAML) on multiple branches one by
one, we need to correct your result to avoid false discovery
rate (there are many methods (see the wiki page:
http://en.wikipedia.org/wiki/False_discovery_rate).

One I like is the
QVALUE package, for R:

Launch R and install QVALUE by
typing:

source("http://bioconductor.org/biocLite.R")

biocLite("qvalue")

and load it:

library(qvalue)

We need to load the
list of p-value into R. This is a single with only p-values. The
order is really important, so we have in a separate file the ordered
names of the branch and/or gene tested.

0.00623359809654

0.148525646364

1.0

0.0682850741607

0.0322125346865

An file containing p-values from 760 gene families is provided
here as an example:

Once we have our file
of p-values, we can load it into R:

p <- scan("pvalues.list", na.strings=T)

First, have a look at the distribution:

which will display this histogram:

The distribution is bimodal, as expected: when the test is negative, we will have a
p-value = 1, and when the test is significant, the p-value is generally very
low.

Now, we can perform the qvalue
correction:

qobj <- qvalue(p, pi0.meth="bootstrap", fdr.level=0.05)

Two comments:

- The option “bootstrap” is really important, as the distribution is bimodal (see page 11 of the QVALUE manual).
- The
fdr.level=0.05 is a threshold which estimates that 5% of our tests could
be false-positive.

Finally, we write the output into a file:

write.qvalue(qobj, file="qvalues.list")

The output file has five columns:

- The p-values, in the same order as in the input file.
- The corresponding q-value.
- The pi0 value.
- The significance of the test, based on the threshold (0 or 1).
- The fdr threshold used to assign as significant.

"pvalue" "qvalue" "lfdr" "pi0" "significant" "fdr.level"

0.00623359809654 0.00598773686818557 0.0291702400633863 0.323886639676113 1 0.05

0.148525646364 0.0731203182099692 0.391218512376612 0.323886639676113 0 0.05

1 0.323886639676113 1 0.323886639676113 0 0.05

0.0682850741607 0.0386405371024297 0.206113388985345 0.323886639676113 1 0.05

0.0322125346865 0.0212580142290782 0.111050362300586 0.323886639676113 1 0.05

Please don't hesitate to contact me if you have any comments or questions.

Hi Romain,

ReplyDeleteI'm stuck on the concept of false discovery rate and how to prepare the p-values file to feed the Qvalue package.

Well, my issue is that I have run branch-site model for about 8000 genes for four lineages, lets call them species A, B, C and D, separately. I have results for p-values for each genes for each of those species.

Now when you say "The order is really important, so we have in a separate file the ordered names of the branch and/or gene tested."

How do I set the file and order the p-values to correct for the FDR four hypothesis multiple testing based on q-value?

Appreciate your help and in case it is not clear I may need to elaborate further.

Regards,

Edson

Hi Edson,

DeleteAll the p-values you got, you will have to put them in one single files, which will be like this:

Gene1_SpeciesA 0.00623359809654

Gene2_SpeciesA 0.148525646364

Gene3_SpeciesA 1.0

Gene1_SpeciesB 0.0682850741607

Gene2_SpeciesB 0.0322125346865

.

The order is not important that file, only to have one p-value associated to one test.

But the problem is that q-value work only with a list of p-values in a single column. So you will have to have a file with only p-values, such as:

0.00623359809654

0.148525646364

1.0

0.0682850741607

0.0322125346865

=> This is where the order is important, to know which p-value (and so q-value) corresponds to which test. You can then just append the file in LibreOffice (or Excel) to then have the q-value in front of the test.

I hope it is clear enough.

Romain

Dear Romain,

DeleteThanks very much for your help. Let me work on these ideas and hopefully I will get it through.

I just tried this and the qwrite function could not be found. It turns out in the latest version of the qvalue package they use a write.qvalue() function. This is the example they give:

ReplyDelete# write q-value object

qobj <- qvalue(p)

write.qvalue(qobj, file="myresults.txt")

Thank you very much! I updated the post.

DeleteJust to let you know its file="" not filename="" with filename you get a unused argument warning .. I guess they decided to change argument names also :)

DeleteThanks, corrected! They also changed the output, which is now better for parsing, with only one line as header.

DeleteHi Roman,

ReplyDeleteI conducted a positive Selection analysis on ~19, 000 genes that were obtained from whole-genome sequencing on an Illumina HiSeq 2000. In particular, I have several complete genomes of different species and I am testing whether the genes of one species (foreground) could be under positive selection. I compared the same null vs alternative model for each gene family, independently. Therefore, I was thinking to performed false discovery rate correction with the bootstrap method. However, the distribution of my p-values is neither uniform nor binomial. Specifically, the majority of my p-values are one or very close to this value. Therefore, in my histogram there is a high bar of p-values in one, and the rest of the p-values (including the ones close to 0) have low frequencies.

Is there anything I could do instead of the bootstrap method? like some sort of null distribution that reflects a more uniform distribution of p-values.

Thanks in advance for your help

Hi Daniel,

DeleteYou could still use bootstrap, even it is not bimodal. You could also try the option "smoother" instead of "boostrap". There are some parameters you can tweak in the qvalue package: http://genomics.princeton.edu/storeylab/qvalue/manual.pdf

Romain

Hello Romain,

ReplyDeleteThank you very much for your blog. I would like to ask whether using (pFDR) qvalue is better than (FWER) methods such as Bonferroni, B-Y method etc., I read the paper from Anisimova and Yang (http://mbe.oxfordjournals.org/content/24/5/1219.full) and it is mentioned that

//It is inappropriate for branch–site tests of positive selection, as the error rate (pFDR) is 1 when all null hypotheses are true, that is, when no branch on the tree is under positive selection//

Hence I am bit confused, I have 5 branches with pvalues (listed below) Bonferroni, Hochberg, BY resulted no significant value, where as qvalues indicate 1 significant in C. Could you explain this please.

A 1.0

B 0.03

C 0.01

D 1.0

E 1.0

Thanks and Regards,

Ralph

Hi Ralph,

DeleteUsually, the threshold used to assign a p-value as significant is 0.05. In your case, B and C would be significant.

But as you did five comparison, you want to correct for false-discovery rate.

With Bonferroni, the simplest but most stringent (FWER), you divide the threshold (a=0.05) by the number of comparison (0.05/5 = 0.01). So the threshold is now 0.01, and then none of your p-values are below this threshold.

I guess the problem is similar with FDR correction as Hochberg or BY.

Q-value works differently, has it roughly normalize all p-values, this is why you get one significant hit.

And it is even stronger if you use the bootstrap option. With FDR=5%, you will get 2 significant hit, B (q=0.0473) and C (q=0.0315).

In your case, I would keep the two branches B and C, and discuss them as significant results. It might be possible that if you increase the number of sequences in your dataset, you would get more power and so lower p-values (more significant).

Best regards,

Romain

Hi Romain,

ReplyDeleteFirst of all thank you for your very clear posts about positive selection analysis.

I have conducted a series of positive selection analysis on different enzymatic families and applied the branch site model provided by PAML. I have now a csv doc with the corresponding branch site on the tree and the p values.

My first question is if I should be calculating the q values based on my p values or the local fdr which can actually provide more insight for each individual site ? I can see that the output text file provide both qval and lfrd which are significantly different and therefore I am a bit confused for which one I should go for. I have used the fdrtools package provoded by R and calculated the lfdr values but your post made me reconsider this decision.

My second question is what are the pval/qval or lfdr (depending on which one I should calculate) cut off ? At the moment I have a cut off of 0.001 which seems to be very strict based on you posts. Is there any way I would make this choice more specific based on my data set (See p value distribution)?

Hi Eirini,

DeleteI would suggest to use the p-values to compute the q-value.

For the cutoff, it is quite arbitrary. It depends how false positive you could tolerate in your final results. 5% or 10% are quite common values for false-discovery rate using q-value.

If you want, you can send me the CSV by email and I can have a look at it.

Best regards,

Romain

Hi Romain,

DeleteThanks for your reply. I will email to you my csv file with more details.

Thanks again,

Eirini

Hi Romain,

ReplyDeleteI have a couple of questions regarding the calculation of the qvalues using the R package.

1. You mention at your post that you include the option for bootstrapping due to the fact that the data appear to be bimodal. My data do not appear to be bimodal. The majority of my data lie at value 1, and a few of them at around 0. Should I still use the bootstrapping option in this case ?

2. You have introduced a fdr = 0.05 to your qvalue calculation. Is that a liberally chosen cut off ? How could someone be more precise on the choice of the fdr value? Maybe prior calculation of the local fdr values and observation of the data would be a way to determine the fdr value ?

3. There are quite a few researches on genomic data which have used p-adjust methods instead of the qvalues. To me the difference between those two is not clear. From my understanding q values are p adjusted fdr values. Please correct me if I am wrong. Why do you suggest using qvalues for multiple test correction ?

Thank you in advance for your time.

Eirini

Hi Eirini,

Delete1) The bootstrapping provides a more robust way to apply q-values, especially in bimodal distribution. In general, the test of positive selection is frequently either 1 or very small. In the manual of q-values, they recommand the bootstrap option for such cases (page 11):

http://genomics.princeton.edu/storeylab/qvalue/manual.pdf

But in your case, you could try the other option of qvalue ("smoother").

2) It is a common value. But at the end, it depends how many false positive you can tolerate and how many false negative you want to retrieve back. You can lower the threshold to 1% or increase to 10%.

3) there are plenty of methods to correct for false-discovery. Qvalue is one of them, which works well on large dataset such as gene tests. Can you put a few links to some of these research studies?

Best regards,

Romain