Wednesday, 24 July 2013

False discovery rate correction for multiple testing of positive selection [Pratictal]





When performing a test for the detection of positive selection (i.e. with CodeML/PAML) on multiple branches one by one, we need to correct your result to avoid false discovery rate (there are many methods (see the wiki page: http://en.wikipedia.org/wiki/False_discovery_rate).

One I like is the QVALUE package, for R:


Launch R and install QVALUE by typing:
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")
and load it:
library(qvalue)
We need to load the list of p-value into R. This is a single with only p-values. The order is really important, so we have in a separate file the ordered names of the branch and/or gene tested.

0.00623359809654
0.148525646364
1.0
0.0682850741607
0.0322125346865
An file containing p-values from 760 gene families is provided here as an example:


Once we have our file of p-values, we can load it into R:
p <- scan("pvalues.list", na.strings=T)

First, have a look at the distribution:

hist(p, breaks = 20, main = paste("Distribution of p-values"), xlab="Value")

which will display this histogram:
The distribution is bimodal, as expected: when the test is negative, we will have a p-value = 1, and when the test is significant, the p-value is generally very low.

Now, we can perform the qvalue correction:
qobj <- qvalue(p, pi0.meth="bootstrap", fdr.level=0.05)

Two comments:
  1. The option “bootstrap” is really important, as the distribution is bimodal (see page 11 of the QVALUE manual).
  2. The fdr.level=0.05 is a threshold which estimates that 5% of our tests could be false-positive.

Finally, we write the output into a file:
write.qvalue(qobj, file="qvalues.list")

The output file has five columns:
  1. The p-values, in the same order as in the input file.
  2. The corresponding q-value.
  3. The pi0 value.
  4. The significance of the test, based on the threshold (0 or 1).
  5. The fdr threshold used to assign as significant.


  6. "pvalue" "qvalue" "lfdr" "pi0" "significant" "fdr.level"
    0.00623359809654 0.00598773686818557 0.0291702400633863 0.323886639676113 1 0.05
    0.148525646364 0.0731203182099692 0.391218512376612 0.323886639676113 0 0.05
    1 0.323886639676113 1 0.323886639676113 0 0.05
    0.0682850741607 0.0386405371024297 0.206113388985345 0.323886639676113 1 0.05
    0.0322125346865 0.0212580142290782 0.111050362300586 0.323886639676113 1 0.05
Now we can say which test is significant and which is not.


Please don't hesitate to contact me if you have any comments or questions.

14 comments:

  1. Hi Romain,

    I'm stuck on the concept of false discovery rate and how to prepare the p-values file to feed the Qvalue package.

    Well, my issue is that I have run branch-site model for about 8000 genes for four lineages, lets call them species A, B, C and D, separately. I have results for p-values for each genes for each of those species.

    Now when you say "The order is really important, so we have in a separate file the ordered names of the branch and/or gene tested."

    How do I set the file and order the p-values to correct for the FDR four hypothesis multiple testing based on q-value?

    Appreciate your help and in case it is not clear I may need to elaborate further.

    Regards,

    Edson

    ReplyDelete
    Replies
    1. Hi Edson,

      All the p-values you got, you will have to put them in one single files, which will be like this:
      Gene1_SpeciesA 0.00623359809654
      Gene2_SpeciesA 0.148525646364
      Gene3_SpeciesA 1.0
      Gene1_SpeciesB 0.0682850741607
      Gene2_SpeciesB 0.0322125346865
      .

      The order is not important that file, only to have one p-value associated to one test.

      But the problem is that q-value work only with a list of p-values in a single column. So you will have to have a file with only p-values, such as:
      0.00623359809654
      0.148525646364
      1.0
      0.0682850741607
      0.0322125346865

      => This is where the order is important, to know which p-value (and so q-value) corresponds to which test. You can then just append the file in LibreOffice (or Excel) to then have the q-value in front of the test.


      I hope it is clear enough.


      Romain

      Delete
    2. Dear Romain,

      Thanks very much for your help. Let me work on these ideas and hopefully I will get it through.

      Delete
  2. I just tried this and the qwrite function could not be found. It turns out in the latest version of the qvalue package they use a write.qvalue() function. This is the example they give:

    # write q-value object
    qobj <- qvalue(p)
    write.qvalue(qobj, file="myresults.txt")

    ReplyDelete
    Replies
    1. Thank you very much! I updated the post.

      Delete
    2. Just to let you know its file="" not filename="" with filename you get a unused argument warning .. I guess they decided to change argument names also :)

      Delete
    3. Thanks, corrected! They also changed the output, which is now better for parsing, with only one line as header.

      Delete
  3. Hi Roman,

    I conducted a positive Selection analysis on ~19, 000 genes that were obtained from whole-genome sequencing on an Illumina HiSeq 2000. In particular, I have several complete genomes of different species and I am testing whether the genes of one species (foreground) could be under positive selection. I compared the same null vs alternative model for each gene family, independently. Therefore, I was thinking to performed false discovery rate correction with the bootstrap method. However, the distribution of my p-values is neither uniform nor binomial. Specifically, the majority of my p-values are one or very close to this value. Therefore, in my histogram there is a high bar of p-values in one, and the rest of the p-values (including the ones close to 0) have low frequencies.

    Is there anything I could do instead of the bootstrap method? like some sort of null distribution that reflects a more uniform distribution of p-values.

    Thanks in advance for your help

    ReplyDelete
    Replies
    1. Hi Daniel,

      You could still use bootstrap, even it is not bimodal. You could also try the option "smoother" instead of "boostrap". There are some parameters you can tweak in the qvalue package: http://genomics.princeton.edu/storeylab/qvalue/manual.pdf


      Romain

      Delete
  4. Hello Romain,

    Thank you very much for your blog. I would like to ask whether using (pFDR) qvalue is better than (FWER) methods such as Bonferroni, B-Y method etc., I read the paper from Anisimova and Yang (http://mbe.oxfordjournals.org/content/24/5/1219.full) and it is mentioned that
    //It is inappropriate for branch–site tests of positive selection, as the error rate (pFDR) is 1 when all null hypotheses are true, that is, when no branch on the tree is under positive selection//
    Hence I am bit confused, I have 5 branches with pvalues (listed below) Bonferroni, Hochberg, BY resulted no significant value, where as qvalues indicate 1 significant in C. Could you explain this please.
    A 1.0
    B 0.03
    C 0.01
    D 1.0
    E 1.0

    Thanks and Regards,
    Ralph

    ReplyDelete
    Replies
    1. Hi Ralph,

      Usually, the threshold used to assign a p-value as significant is 0.05. In your case, B and C would be significant.

      But as you did five comparison, you want to correct for false-discovery rate.

      With Bonferroni, the simplest but most stringent (FWER), you divide the threshold (a=0.05) by the number of comparison (0.05/5 = 0.01). So the threshold is now 0.01, and then none of your p-values are below this threshold.

      I guess the problem is similar with FDR correction as Hochberg or BY.

      Q-value works differently, has it roughly normalize all p-values, this is why you get one significant hit.

      And it is even stronger if you use the bootstrap option. With FDR=5%, you will get 2 significant hit, B (q=0.0473) and C (q=0.0315).

      In your case, I would keep the two branches B and C, and discuss them as significant results. It might be possible that if you increase the number of sequences in your dataset, you would get more power and so lower p-values (more significant).


      Best regards,
      Romain

      Delete
  5. Hi Romain,

    First of all thank you for your very clear posts about positive selection analysis.

    I have conducted a series of positive selection analysis on different enzymatic families and applied the branch site model provided by PAML. I have now a csv doc with the corresponding branch site on the tree and the p values.

    My first question is if I should be calculating the q values based on my p values or the local fdr which can actually provide more insight for each individual site ? I can see that the output text file provide both qval and lfrd which are significantly different and therefore I am a bit confused for which one I should go for. I have used the fdrtools package provoded by R and calculated the lfdr values but your post made me reconsider this decision.

    My second question is what are the pval/qval or lfdr (depending on which one I should calculate) cut off ? At the moment I have a cut off of 0.001 which seems to be very strict based on you posts. Is there any way I would make this choice more specific based on my data set (See p value distribution)?

    ReplyDelete
    Replies
    1. Hi Eirini,

      I would suggest to use the p-values to compute the q-value.

      For the cutoff, it is quite arbitrary. It depends how false positive you could tolerate in your final results. 5% or 10% are quite common values for false-discovery rate using q-value.


      If you want, you can send me the CSV by email and I can have a look at it.

      Best regards,
      Romain

      Delete
    2. Hi Romain,

      Thanks for your reply. I will email to you my csv file with more details.

      Thanks again,
      Eirini

      Delete