Wednesday, 20 November 2013

How evolutionary biologists can help biologists from other fields (especially the medical side).

In a recent post, Dan Graur made various criticism to a paper publish last year in PloS One:

and here is the link to the paper:

I like the concept of this paper, which is mixing various bioinformatics methods to identify evolutionary pressures on sites and trying to correlate them with disease mutations (see also our review on the subject ( and this recent one ( )

I have other comments (more recommendation) I would like to share in this blog:

- They used alignment from ClustalW, which is surprising as it is the oldest of these methods. Better methods have been publish since them, one can cite Muscle, Mafft, Probcons, T-coffee and Clustal-Omega. The authors compared some of them, but found that ClustalW was the best to use in their study. They could also have  tried some phylogeny-aware method to align (PRANK, PAGAN), to see if it makes a difference or not.

Recommendation number 1: try and/or use different alignment methods that have been proven to be robust.

- There is no indication on how they built the tree (NJ? ML? Bayes?). I tried to produce a tree, but I got the same results, with the Rat/Mouse clade also badly placed. These two sequences are highly divergent compared to the other sequences, so no surprise that even sophisticated algorithms produce strange results.

Recommendation number 2: while Minimum Evolution (FastME) and Neighbour-Joining (FastTree) can provide some good results, always try and/or use different methods to build a phylogenetic tree, especially with Maximum Likelihood (PhyML, RAxML) and Bayesian (MrBayes).

Recommendation number 3: always indicate all the details on the methods used: name of the program, release used, parameters used (other than default).

- They could have root the tree by Teleost fishes:

This is not wrong, but it would be much better to present it as rooted.

Recommendation number 4: always try to present your results in the light of evolution.

- The methods they used to analyse conservation/functional divergence are sensitive to the quality of the input, but I think they can tolerate a few topological error. For example, Pupko and Galtier find that their algorithm gives the same result with different trees:
"Several controls were performed to check the robustness of these results. Results were essentially unchanged when a different phylogenetic tree (Reyes et al. 2000) was used (not shown)."
As the algorithms are similar (i.e. Diverge), I would not be surprised to see a similar result with the taxonomic tree. Of course, as a reviewer, I would ask the authors to use the two alternative topologies: taxonomic tree and genetic tree.

- The sheep is badly placed, but considering the short branch, not really a surprise. And again, I don't think this will influence greatly the final result.

- However, my main worry is the power in their analysis, and what they tried to do. For example, they used DIVERGE like this: "DIVERGE site-specific evolutionary constraint values were computed using the depth 4 (vertebrate) data set only. DIVERGE was run by splitting the vertebrate phylogeny at the deepest node separating the fish from the terrestrial vertebrates, and Type II divergence values were recorded." So they are comparing if there is any sites under functional divergence between Fishes (3 species) and Tetrapodes (15 species).

There are three problems here:
1) They just analysed the deepest evolutionary event. If anything functional divergence happened later (i.e. in Mammals), we could not detect it.
2) They used a clearly unbalanced dataset (3 versus 15).
3) With 3 species in one side, they have clearly no power, no accuracy. The pattern of conservation (or divergence) can just be noise. Normally, at least four species are needed in each clade to see a significant pattern.

Recommendation number 5: always try to have a significant amount of data in order to have enough power. We are in the genomic era, so a wealth amount of sequence is available. For example, thousand of sequences enable to have significant power to predict 3D structures.

Recommendation number 6, and the most important: always try to understand what you are doing. Especially to see if the methods you are using are appropriate on your data. Most of evolutionary bioinformatics tools are not straightforward, and some are quite complex. For example, I have spent a significant part of my PhD to understand and implement CodeML/PAML. While I am quite confident in it, I think I have always things to learn in this field.

=> If you are not familiar with evolutionary bioinformatics tools, ask a colleague or an expert. They will be more than happy to help.

No comments:

Post a Comment