Sunday, 27 May 2018

Leaving academia: failure or chance? A personal story

This blog post will be a bit more personal, as I will talk about my career, the paths I took and the mistakes I made. This might be useful for other scientists in the same situation, despite every case is different.
[I am voluntarily omitting any names of my collaborators, supervisors, teachers, but I am grateful for all of them for their advices and support]

Undergraduate studies

I was born in a town close to Lausanne, Switzerland. So, naturally, I did my undergraduate studies in biology at the University of Lausanne. Nothing really particular to mention there. The first two years (“propedeutics”) were generic (botanic, zoology, molecular biology) and the next two years were more specialised (biochemistry, immunology). Anecdote: I had the chance to follow the lecture on biophysics from Jacques Dubochet, the recent Nobel prize. He is the kind of professor that made a course alive. One thing he mentioned recently is that he wouldn’t have been able to do a similar academic career today due the intense competition for everything.

During my third year, I did a internship in immunology. It was a good experience, but I destroyed quite a bunch of glass tubes and other tools in the lab. This was a clue that I was not made for any experimental labs, and that the only mice I would then torture would be from Logitech or Apple. During my fourth year, I had the great opportunity to do my master project within the Swiss Institute of Bioinformatics (SIB). I was working with sequences alignments and homology modelling of 3D protein structures. It was a blast and the moment where I knew how to orient my career.

One drawback for studying biology is that most courses are mostly based on strict learning, and not much about problems-thinking. We only have to learn what the professor taught and write it down at the exam. In the end, very little thinking involving computation or equation solving. As I have a very bad memory, I was struggling with some courses like biochemistry or zoology, while I excelled in other courses like physics or java programming (because we could have notes during exams). Another thing I lacked during my undergraduate was a modern course in statistics, and practicals in R. A shame that the current professor of statistics arrived only a couple of years later. I know his lectures are good, because I was involved as assistant for the practical during my PhD studies.

PhD study: 2005-2010

I started my PhD in 2005, still in Lausanne, with a four-year contract. It seems rather long compared to international standards (maybe not US), but I think it is a good length, as I was quite productive in my last year. I worked on computational biology, especially in molecular evolution and phylogenetics, at the Department of Ecology and Evolution. It was a good period of my life. We frequently read that we could be lost/depressed/down during the PhD, but I never had this feeling. It could be due to a good mixture of supervision/environment. There was definitively something special in the department daily life [1]. Lot of social and professional interactions between members, whether you are PhD student, postdocs or group leaders, and a rather flat structure.

A good advice I received from my supervisor was to not wait the end of the PhD to search for the next step, but start earlier. So in 2009, I applied to three fellowships to go to University College London: Swiss National Science Foundation (SNSF), Marie-Curie and EMBO Long-Term. I was rejected to all three during the first round. Some reviewers said my project was not ambitious enough, other said I haven’t had enough papers in good journals (the same who think that bioinformaticians just need 2-3 clicks to process data and publish in Nature...). Luckily I managed to obtain a short-term contract for 2010, right after my PhD. I then re-applied in 2010, and finally obtained the SNSF fellowship. While I applied for the maximum number of years, I was awarded only 1 year of fellowship, due to the University of Lausanne policy. One year is better than nothing, but still, there is not much we could in one year in my opinion.

1st Postdoc 2010-2013: University College London

In August 2010, I flew to London, to join the Department of Structural and Molecular Biology. I worked on a molecular evolution project, but with a deeper focus on 3D protein structures. I stayed three years and it was very good in term of academic freedom, as I came with my own funding. So my time was split between developing my research projects, trying new things (that didn’t work, eh this is science) and participating in the lab’s databases development.

One drawback during my first year is that I had only one year of fellowship, so I wasn’t sure what would be my life after August 2011. This provided a little sense of uncertainty during the first year, and kind of emergency to have to some results at least published in a year. No need to say that in these conditions, I wasn’t planning to do something very ambitious and very long. I also spent some time writing my application for the Advanced SNSF researcher, which I got it and was awarded for two years this time (the maximum allowed by the SNSF). In 2013, I applied to come to Switzerland, but this didn’t work. Luckily, I got another postdoctoral contract at EBI.

2nd Postdoc 2013-2017: European Bioinformatics Institute (EBI-EMBL)

In 2013, I joined EBI-EMBL, in Hinxton, Cambridgeshire, UK. I stayed there for nearly four years, until April 2017. I worked on a phosphoproteomics project. While these years were great at the scientific level and with colleagues, they were quite tough as I started to apply to other, more permanent position in academia (the doom of fixed-term contract for postdoc). By more permanent, I mean tenure-track professor, lecturer, principal investigator or group leader, depending of the institution or funding agency. I applied to almost 50 positions, with barely no success. I was invited for interview to a few of them, but competition was fierce. One of my numerous problems was that my proposed project was not interesting enough. Another problem is I didn’t put an experimental component in my proposals, which would allowed me to target high-impact journals. Newly generated data can be re-used by other and increase the paper citations. I also started to apply to industry position, lightly at the beginning, but more intense as the end of my contract arrived, along with multiple rejections from applications in academia. Luckily, I managed to get a position in industry before the end of my contract. I will write about that in a future post.

Some statistics and various comments about my applications:

I grouped my applications in three sectors:
  1. University, which speaks for itself.
  2. Academia = other non-profit institutes such as EMBL, Pasteur, Crick. They tend to have better funding and higher salaries than local universities.
  3. Industry = pharma and biotech companies.

Timeline of my applications by month and by sector (2013-2017):


My first attempt was in February 2013 for an SNSF Ambizione fellowship. It was just before starting at EMBL-EBI. The success rate was 40% when I decided to leave Switzerland in 2010, but suddenly dropped to 24% in 2013. Depending of the department policy, you can either be a group leader or a “super-super-super post-doc” (not my words) attached to a lab. I was invited for the interview at the SNSF headquarters in Berne. I apparently did a good presentation, but I failed the questions part. I was definitively not well prepared for that. I reapplied for the same scheme in 2014, but not even been invited for the interview this time, and that was the last possible time (maximum five years after PhD). Then I applied for a SNSF professorship in 2015 and 2016. This grant is really good and provides all the money you need to run a small lab (PI +PhD students + postdoc) for 4 years, extensible for 2 more years. Again, other failures, because my level of publications was good, but not excellent (aka not published in Cell , Nature, Science [CNS]). Then, since Summer 2015, I applied to other positions at various places, which I will describe below. 

Number of applications by country and by sector:



  • As a Swiss national living in UK with family, there is no big surprise that I mostly applied in both countries. 
  • Switzerland is extremely competitive for academia. Being a national doesn’t give you any privilege, only scientific record matters. Salary is huge (but living there can be very expensive too, people from abroad don’t necessarily see that), so many people from around the world apply there. For each position I applied, there was between 100 to 250 other applicants. The problem now in Switzerland is they put the tenure-track system. It is not a problem per se, but there is the risk  that after 4-6 years, you cannot stay there. Which means the need to find another job at 40-42 years old, which might not be so easy. And even worse for the SNSF professorship, as the host university doesn’t necessarily have a backup plan to keep the professor after his/her grant expires. But this non-tenure grant system is not restricted to Switzerland and seems to be more and more common around the world. 
  • UK can also be very competitive, depending on the place where you apply. I was invited for an interview at University of Leeds. It is was a very positive experience, I met a lot of interesting people, but in the end I wasn’t selected for their scheme. The main reason was that my research topic was not cutting-edge, so it might difficult to apply to grants. I also applied to some pharma/biotech companies around the London/Oxbridge triangle. Surprisingly, industry is less competitive than academia, as there are much fewer applicants per position (at least in bioinformatics). I always rapidly got an answer and most of the time a phone call. However, I had two handicaps there: 1) I haven’t had previous experience in industry (except a six-month training in Nestlé before my PhD, so looooong time ago) and 2) I didn’t have a sufficient exposure in genomic sequence analysis (NGS, RNAseq, WGS, GWAS), which was most of the time a required skill. Eventually, I did end up working in a biotech company in London, which will be the subject of a future blog post.
  • In France, I only applied to the Pasteur Institute. The first time I applied, I was not invited. I re-applied six months later, and then I was invited for an interview. I was told they spotted my application in the previous round, so they decided to give me a chance this time. But in the end, they funded people with research topics more in line with the institute (and again, I kind of screwed the question/answer part during interview).
  • US: I applied to some places in the US, but was never invited. I guess it is even harder to go there from Europe if you don’t have any network.

[Introspection] Some reasons why I failed in academia:

  • Not working on cutting-edge topics (next-generation sequencing, gwas, single-cell, you name it). While phosphoproteomics is quite fancy these days, it is still less marketable than genomics. I also haven’t put any experimental component in my grant projects, and new data is definitely needed when you want to publish in CNS journals.
  • Didn’t race for the CNS papers early enough in my career. I published my first postdoctoral work in PNAS, and some people call this journal Paper Not Accepted in Science (but no mistakes, I am still proud to have published there, it was the first journal we targeted and the perfect fit for it).
  • During my two postdocs, I should have applied to other fellowships in UK. Getting money from grants increase the success of the next applications.
  • Introverted character. In science, it is quite common to meet introverted people. I made some online test, and without surprise, I am type INT-J with a strong bias towards introversion. Basically, I don’t like to express myself in public, whether it is group meeting or to talk in conference. I am also not particularly keen of walking in large conference halls and trying to talk to people I have never met before. Which could be a problem when you are trying to build your network (so I compensate by communicating on internet).
  • Another problem is I am slightly more interested on the technical aspects (low level) of a project, rather on the general aspect, or big picture (high level). For example, when I review a paper, I will try to replicate the result as much as I can, until I am satisfied. It is useful as a highly skilled engineer, but less as a visionary group leader or manager. I recently had this discussion with a former colleague, and the conclusion is we just need to accept who we are and do what we are good for. And this is a problem in academia, most permanent position are for group leader, who needs managerial skills. Staying in academia as highly skilled professional is more difficult, as there are barely no permanent position as super-postdoc, or you are dependant on the lab budget, which can switch rapidly.

Some random advices (for any level, undergraduates, graduates, postdocs):

  • If you love sciences and have some skills, go for undergraduate studies. And if after that, you think you would do science as your main job, go for a PhD. It is the highest diploma you can get in academia, and many companies in industry are using it as a threshold when doing recruitment for researchers. It also helps to climb the career ladder.
  • But at the same time, you must be aware there was a boom of PhD in sciences, and the academic system cannot sustain the increased demand for jobs. You need to be aware of this at the earliest in your career. You can be among the top scientist of your classes, you can be clever and you can work hard, this might not be enough. At this point, the number of applications per position means there is a stochastic process, where randomness occurs and there is nothing to do against. Just try the best and not be disillusioned if you don’t succeed.
  • To succeed in academia, it is sad, but only publications in CNS journals matters (Cell, Nature, Science). Publications in other top journals (Genome Research, PNAS, Mol Sys Biol, PLoS Biology) are very good, but there will always be someone else coming with such publications in CNS journals.
  • Avoid writing book chapters. I consider it as a loss of time and energy. First, you are not paid for that (but the publisher will make some money). Then, if it is an original study or review: wrap it up to publish it in a peer-review scientific journal (with open access option). Finally, if this is something more technical like a tutorial, just publish it on your own blog, free and accessible for everyone.
  • Try to learn methods that are useful in industry, in case you want to shift at some stage of your career.
  • One year before the end of your current contract (PhD, postdocs), start looking for the next step in your career. It takes time to find a host, to apply for grant, then to re-apply in case of rejection.
  • If you want to do postdoc, I recommand to apply for fellowships (Marie-Curie, EMBO, SNSF, etc...). They can offer a greater freedom and you could more easily find a lab, as you are going with your own funding.
  • If you are hesitant about industry or academia, don’t waste time on a postdoc and start applying to industry position.
  • If you didn’t succeed to publish your postdoc project in Cell/Nature/Science, or at least one of the top journals in your field, start reconsidering your career in academia.
  • Doing a second postdoc means you didn’t succeed to go up to the next level on the academic scale during your first postdoc. At this stage, you should already be on lecturer or professor tenure-track position, or at least on an independent fellowship leading to lecturer/professor position. So it is really time to move out academia and search for another career.


Despite the story I wrote seems rather bleak, it doesn’t mean failing in academia is the end of everything. I started working in a top-notch company exactly one year ago, and I didn’t regret it. This will be the story for another blog post.


(1) Our third area of specialisation consists in having weekly parties and numerous other social events. We realised that mutualism and cooperation lie at the heart of the ecological success of many organisms on earth and therefore make use of this important observation to enhance productivity and happiness. Collaborations between the different research groups is a Department priority, with two weekly seminars and numerous journal clubs taking place. All the resources and infrastructures are shared between research groups, providing a stimulating and nice atmosphere to students and all other members of the Department.

Further readings:

Sunday, 29 April 2018

Internet and privacy: useful tools

Hi all,

In this post, I will talk about something different than sciences. It is about privacy on internet. The recent scandal of Facebook - Cambridge Analytica has revealed what was known for a long time: if a service is 100% free, the likelihood you are paying with your data is great. And data could be anything about you on internet. Think about what you searched on internet, pictures of your kids on Instagram, where you have been with your smartphone (GPS), any email you exchanged (health matter, commercial transaction, bank account). I had a yahoo email account for 20 years. It was great for a long time, but started to goes down in the last years. And more worryingly, I was part of the hacked batch (like other 3 billions) a few years ago. Anyway, some of my friends received spam emails from (and I got some too). So I started to switch some of my internet services.

I will now present some options for adding a bit more privacy while using internet (web browser, web searches, emails and instant messaging).

Web browser:

Mozilla Firefox:

I am a longtime user of Firefox. The new core engine ("Quantum") is incredibly fast.

Some useful extensions:

- HTTPS everywhere:
==> check if the website can have 'https' instead of 'http' (in case they have both)

- NoScript Firefox extension:
=> block any script. A bit tedious to set up white list, but then it is great.

- uBlock Origins:
=> also block ads and various things.

Other tools (not tested):
- uMatrix:

Web searches:


It is a free search engines, that agglomerate results from Wikipedia, Bing, Yahoo and others.
It doesn't track/keep your searches (instead of Google).
The business model is based on feature ads, but not personalised.

You can easily switch it as your main engine in Firefox, Safari or Chrome.
(i.e. Firefox: Preferences => Search => Default Search Engine).

It also has some geeky useful features:

Theme settings:

Many features to alter the webpage theme.


Country location: 

You can specific the country of searches, or agnostic.


Cheat sheet:

Try these searches:
firefox cheat sheet
python cheat sheet
pandas cheat sheet
emacs cheat sheet

Instant Answers:

For example, they provide highlight from Stack overflow if your search is IT oriented.

pandas check float
seaborn time series


Bang provides direct access to your webservices, using the keyword "!bang" in front of your search.

!pubmed protein structure evolution codeml
!uniprot TLR4_human
!m camden town
!w hawking

A list for academia bangs is here:

An other alternative to DuckDuckGo would be Qwant:
It has the advantage of been hosted in Europe, if someone doesn't want to have any data in US (personally I don't mind).



As mentioned, I used to have Yahoo! mail. After the hack, I then shifted to ProtonMail, and I am still using it (even moved to the monthly subscription option).

Some advantages:

- End-to-end encrypted email, locally stored in Switzerland.
- Open source protocol:
- Encryption algorithm proof-tested.
- Free (with limited space).
- Subscription option with reasonable price, allows more emails and more space.
- Clean web interface and app for iOS and Android.
- Team communicates on Twitter in case of new features or problems.
- Bridge app to use with Outlook, Thunderbird, Apple Mail (never tried):


- No possibility to directly search in the body of email, due to encryption.
- No IMAP (for the moment), but  bridge allows the integration with clients.
- And of course, the end-to-end encrypted encryption only works if your contacts also use encrypted emails.

An alternative would be FastMail:

Instant messaging:


While WhatsApp is the mostly used app, it belongs to Facebook, and it has been revealed they are exchanging data:

Signal is a good alternative:
- More or less same functions as WhatsApp (messages, photo, phone call, video call).
- Open source:
- Encryption proof-tested.

Another option would be Telegram:
I never tried it. And its encryption algorithm has never been proof-tested.


I presented some services I liked. I don't say we should give up all services provided by the GAFA (i.e this blog is hosted on Blogger, from Google), but we should be aware of the trade-off for using free services.

I might update this page if I find other interesting tools in the future.

That's it. ^^

Sunday, 24 September 2017

Update about this blog / Contact

Some updates about this blog

Dear all,

I started this blog six years ago. Reading the first post ( "I hope I will be able to keep a good pace during the year (I set an objective of two articles/month)." => I was indeed quite optimistic! But it seems the fate of many blogs. The direction I took was finally slower and more oriented towards technical things. But it seems to fill a gap, considering the number of views and comments per month, especially for the tutorials on bioinformatics tools.

I would like to share some relevant updates.

Moving from academia to industry

Early this year, I made a big change in my professional life, by moving from academia to industry (after 12 years in academia, since the start of my PhD study). By discussing with many PhD students and postdocs, I will likely write a post detailing this transition. This could be of great help for academic researchers and currently afraid of their professional future (which is natural, considering that only 3% of PhD students will get a faculty position).

Email address

As I recently changed of employer, my previous institutional email address was closed. You can now contact me at the following address (which is not my professional address), for questions or dataset sharing:
Remember that all emails are treated privately, in particular if you send me unpublished dataset.

Donate button

A few months ago, I set up a donate button on the right. First, I would like to thank all who already made a contribution, this is very much appreciated. I will mainly use it to buy new scientific books.

For the precision, this donation is purely optional, and I will still try to answer your questions to the best of my knowledge (and my own pace, which can be slow, sorry for procrastinating). This blog will still be free. As I appreciate when people answer to my questions, I feel beholden to do the same in return. But of course, if I see more encouragement, this might motivate me to write more posts.


Finally, I started to put some advertising on this blog. I will continue this for a few months, but I have the feeling I will remove it after a while. The blog doesn't have enough visits for such system.

Best regards,

Friday, 14 April 2017

Enable java applications in Firefox 52 and above

I was trying to use Phylowidget (, which is a great tool to edit phylogenetic trees (i.e. regrafting branches to other branches):

The problem is it doesn't seem to work anymore!

Why? Simply because Firefox 52 now blocks every plugins except Flash:

Here is a solution to allow Firefox to use others plugins than Flash:

1) Go to main menu Apple -> System Preferences, then Java (very bottom).
Then go to Security panel, and add the url to the Exception Site list.
Here is my parameters:

2) In the Firefox title bar, write:
=> select: I accept the risk!

3) To the very right of the list, do right-click and select:

Then write: plugin.load_flash_only

And set it to "false".

4) Restart Firefox and finally go again to Phylowidget (
Now it should work. And this should be similar step with any other java applet, such Jalview (

Saturday, 18 June 2016

Detecting pervasive positive selection with site-models from CodeML / PAML

Disclaimer: Please don't hesitate to contact me if there is anything which is not working on your computer, or any thing unclear, or even comments to improve it.
My email address:

Theoretical principles:
When mutations are advantageous for the fitness, they are propagated at a higher rate in the population. The selective pressure can be computed by the dN/dS ratio (ω). dS represents the synonymous rate (keeping the same amino acid) and dN the non-synonymous rate (changing the amino acid). In the absence of evolutionary pressure (genetic drift), the synonymous and non-synonymous rates are supposed to be equal, so the dN/dS ratio is equal to 1. Under purifying selection, natural selection prevents the replacement of amino acidS, so the dN will be lower than the dS, and dN/dS < 1. And under positive selection, when mutations are advantageous for the fitness, they are propagated at a higher rate in the population, so the replacement rate of amino acid is favoured by selection, and dN/dS > 1.

We can distinguish two types of positive selection: pervasive positive selection and episodic positive selection. The former implies that a site will be under continuous changes (i.e. adapting to pathogens under arm-race), while the later implies that a site will change once and then be kept in the clade (i.e. providing an advantage in a new environment). We can detect the later using the branch-site model, for which I wrote the previous tutorial:

To detect pervasive positive selection, we will use the site models from CodeML/PAML. Those models allow the clustering of aligned columns (sites) in different groups, each group having a different dN/dS value. There are many different sites models in CodeML, all assuming that the dN/dS ratio is the same across branches, but different between sites.

Here are the different models we will use in this tutorial:

M0: one unique dN/dS for all sites. This is the most basic model.

M1a: assumes two categories of sites: sites with dN/dS<1 (negative selection) and sites with dN/dS =1 (neutral evolution).

M2a: assumes three categories of sites: sites with dN/dS<1 (negative selection), sites with dN/dS=1 (neutral evolution) and sites with dN/dS>=1 (positive selection).

M3: assumes multiple categories of selection, not necessarily positive selection.

assumes 10 categories following a beta-distribution of sites, all with different dN/dS <=1.

M8: assumes 10 categories following a beta-distribution of sites, grouped, all with
different dN/dS <=1, and an additional 11th category with dN/dS >=1 (positive selection allowed).

M8a: assumes 10 categories following a beta-distribution of sites, grouped, all with dN/dS <=1, and an additional 11th
category with dN/dS =1 (no positive selection allowed).


For this practical, you will have to install some tools and downloadS some of my scripts.

- Python2:
- BioPython:
- Jalview:
- Newick utilities:
- PyMOL:
- R:
- RevTrans:
- TrimAl:

You can install many of these packages with Homebrew or Linuxbrew: 

brew install python
brew install R
brew install homebrew/science/mafft
brew install homebrew/science/newick-utils
brew install homebrew/science/paml
brew install homebrew/science/pymol
brew install homebrew/science/trimal 

The python scripts you need to install:


You can download them from my GitHub account:

And install them in any accessible directory (i.e. in the working directory or in the $PATH list).


We will focus on the major histocompatibility complex (MHC) protein, which detects peptides from pathogens. As this gene is in the front line against invaders, it is submitted to strong selective pressure to rapidly detect new antigenic peptides. Early work on positive selection was focused on the MHC, so this is a very good example for this practical.

The uniprot code is HLA class II histocompatibility antigen, DQ beta 1 chain.
The Ensembl gene id is: ENSG00000179344.

# Download data from Ensembl:

Go on Ensembl website, and search for ENSG00000179344

1) Download orthologues sequence:
Comparative Genomics => Orthologues => Download orthologues 

Then choose: Fasta, Unaligned sequences – CDS).

2) Download subtree:

Comparative Genomics => Gene tree
Click on the blue node to select the following group: “Placental mammals ~100 MYA (Boreoeutheria)”
Gene Count    32

=> Export sub-tree   Tree or Alignment
=> Format Newick, options "Full (web)" and "Final (merged) tree".

You have now two starting files:

Sequences file: Human_HLA_DQB1_orthologues.fa
Tree file: HLA_DQB1_gene_tree.nh

We are now going to process these files, in order to generate an alignment, visualise it, remove spurious sequences and columns.

# First, let’s rename these files to have shorter names

cp Human_HLA_DQB1_orthologues.fa HLA_DQB1.cds.fasta
cp HLA_DQB1_gene_tree.nh HLA_DQB1.nh

# Remove the species tag in the gene name

./ HLA_DQB1.nh > HLA_DQB1.tree

# Extract gene names with Newick Utilities 

nw_labels -I HLA_DQB1.tree > HLA_DQB1_names.txt 

# Extract CDS sequences that are in the tree, according to the extracted names

./ HLA_DQB1_names.txt HLA_DQB1.cds.fasta > HLA_DQB1_subset.cds.fasta

# Translate CDS sequences to Amino Acid sequences using python script:

./ HLA_DQB1_subset.cds.fasta > HLA_DQB1_subset.aa.fasta

# Make an alignment of these Amino Acid sequences

mafft-linsi HLA_DQB1_subset.aa.fasta > HLA_DQB1_subset.aa.mafft.fasta

# Align CDS sequences by mapping them on the AA Alignment HLA_DQB1_subset.cds.fasta HLA_DQB1_subset.aa.mafft.fasta > HLA_DQB1_subset.cds.mafft.fasta

# With Jalview, load Human_HLA_DQB1_orthologues_subset.cds.mafft.fasta

# Visualise the alignment to see if there isn’t anything wrong.
# Move human sequence (ENSP00000364080) on top (use the key arrows). This 

# is to tell CodeML to use ENSP00000364080 as reference.
# Save the alignment as FASTA file again (CTRL-S).

# Remove spurious sequences and columns with TrimAl

trimal -automated1 -in HLA_DQB1_subset.cds.mafft.fasta -resoverlap 0.75 -seqoverlap 85 -out HLA_DQB1_subset.cds.mafft.trimal.fasta -htmlout HLA_DQB1_subset.cds.mafft.trimal.html -colnumbering > HLA_DQB1_subset.cds.mafft.trimal.cols

# Convert trimed sequences from FASTA to PHYLIP HLA_DQB1_subset.cds.mafft.trimal.fasta HLA_DQB1_subset.cds.mafft.trimal.phy

# Extract gene names to id.list

grep ">" HLA_DQB1_subset.cds.mafft.trimal.fasta | cut -c 2- > id.list

# Using Newick Utilities, we load the id file to extract a pruned subtree from the starting tree (contains only the taxa from the alignment).

id_list=`cat id.list`
echo "$id_list"
nw_prune -v HLA_DQB1.tree $id_list > HLA_DQB1_subset.tree

# Now we end up with two files:

Alignment: HLA_DQB1_subset.cds.mafft.trimal.phy
Tree: HLA_DQB1_subset.tree

2) Estimation of evolutionary values

This is the core of the tutorial. We will use codeml with three different control files (.ctl). Each computation could take up to 30-60 minutes, depending of your CPU.

# Compute many different site models: M0, M1a, M2a, M3 and M7. Save the following commandS in HLA_DQB1_M0M1M2M3M7M8.ctl file.

     seqfile = HLA_DQB1_subset.cds.mafft.trimal.phy  * sequence data file name
    treefile = HLA_DQB1_subset.tree                  * tree structure file name
     outfile = HLA_DQB1_M0M1M2M3M7M8.mlc             * main result file name

       noisy = 9   * 0,1,2,3,9: how much rubbish on the screen
     verbose = 1   * 1: detailed output, 0: concise output
     runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

     seqtype = 1   * 1:codons; 2:AAs; 3:codons-->AAs
   CodonFreq = 2   * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
       clock = 0   * 0: no clock, unrooted tree, 1: clock, rooted tree
      aaDist = 0   * 0:equal, +:geometric; -:linear, {1-5:G1974,Miyata,c,p,v}
       model = 0   * models for codons:
                   * 0:one, 1:b, 2:2 or more
dN/dS ratios for branches
     NSsites = 0 1 2 3 7 8 * 0:one w; 1:NearlyNeutral; 2:PositiveSelection;       
                           * 3:discrete; 4:freqs; 5:gamma;6:2gamma;
                           * 7:beta;8:beta&w;9:beta&gamma;10:3normal

       icode = 0   * 0:standard genetic code; 1:mammalian mt; 2-10:see below
       Mgene = 0   * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all
   fix_kappa = 0   * 1: kappa fixed, 0: kappa to be estimated
       kappa = 2   * initial or fixed kappa
   fix_omega = 0   * 1: omega or omega_1 fixed, 0: estimate
       omega = 1   * initial or fixed omega, for codons or codon-based AAs

       getSE = 0       * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0       * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
  Small_Diff = .45e-6  * Default value.
   cleandata = 0       * remove sites with ambiguity data (1:yes, 0:no)?
 fix_blength = 0       * 0: ignore, -1: random, 1: initial, 2: fixed

# And execute it (this can take ~ 30-45 minutes)::

codeml HLA_DQB1_M0M1M2M3M7.ctl

# Important! Copy rst file to another name:

cp rst HLA_DQB1_M0M1M2M3M7.rst.txt

# One last thing is to compute site model M8a, which is the same as M8, except we fix the dN/dS to 1 (only negative selection and neutral evolution allowed). Save the following commandS in the file

     seqfile = HLA_DQB1_subset.cds.mafft.trimal.phy   * sequence data file name
    treefile = HLA_DQB1_subset.tree                  * tree structure file name
     outfile = HLA_DQB1_M8a.mlc                      * main result file name

       noisy = 9   * 0,1,2,3,9: how much rubbish on the screen
     verbose = 1   * 1: detailed output, 0: concise output
     runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

     seqtype = 1   * 1:codons; 2:AAs; 3:codons-->AAs
   CodonFreq = 2   * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
       clock = 0   * 0: no clock, unrooted tree, 1: clock, rooted tree
      aaDist = 0   * 0:equal, +:geometric; -:linear, {1-5:G1974,Miyata,c,p,v}
       model = 0   * models for codons:
                   * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
     NSsites = 8   * 0:one w; 1:NearlyNeutral; 2:PositiveSelection; 3:discrete;
                   * 4:freqs; 5:gamma;6:2gamma;

                   * 7:beta;8:beta&w;9:beta&gamma;10:3normal
       icode = 0   * 0:standard genetic code; 1:mammalian mt; 2-10:see below
       Mgene = 0   * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all
   fix_kappa = 0   * 1: kappa fixed, 0: kappa to be estimated
       kappa = 2   * initial or fixed kappa
   fix_omega = 1   * 1: omega or omega_1 fixed, 0: estimate
       omega = 1   * initial or fixed omega, for codons or codon-based AAs

       getSE = 0       * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0       * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
  Small_Diff = .45e-6  * Default value.
   cleandata = 0       * remove sites with ambiguity data (1:yes, 0:no)?
 fix_blength = 0       * 0: ignore, -1: random, 1: initial, 2: fixed

# And execute (this can take ~ 5-10 minutes):

codeml HLA_DQB1_M8a.ctl

3) Results

3.1) Identification of positive selection

Have a look at the mlc files. If you want to retrieve the log-likelihood values:
grep "lnL" *.mlc

HLA_DQB1_M0M1M2M3M7M8.mlc:lnL(ntime: 32  np: 34):  -4838.327776      +0.000000
HLA_DQB1_M0M1M2M3M7M8.mlc:lnL(ntime: 32  np: 35):  -4711.107980      +0.000000
HLA_DQB1_M0M1M2M3M7M8.mlc:lnL(ntime: 32  np: 37):  -4692.732347      +0.000000
HLA_DQB1_M0M1M2M3M7M8.mlc:lnL(ntime: 32  np: 38):  -4692.078126      +0.000000
HLA_DQB1_M0M1M2M3M7M8.mlc:lnL(ntime: 32  np: 35):  -4718.466841      +0.000000
HLA_DQB1_M0M1M2M3M7M8.mlc:lnL(ntime: 32  np: 37):  -4690.545425      +0.000000
HLA_DQB1_M8a.mlc:         lnL(ntime: 32  np: 36):  -4706.268471      +0.000000


The order of lines being: M0, M1, M2, M3, M7, M8 and M8a. For each model, you directly get the number of parameters (np) and the log-likelihood value:


Using these models, we can construct four likelihood-ratio tests (LRT), where three of them will tell us if there is significant positive selection or not.

1) M0-M3: this one is an exception and will only tell us if there are different categories of sites under different selective pressures. This test is not used to detect positive selection, and it is nearly always significant.

2x(L1-L0) = 2x[(-4692.0781) – (-4838.3278)] = 292.4993
d.f. = 38-34 = 4
=> 4.49405E-62

2) M1a-M2a: this test was the first site model developed to detect positive selection. We contrast a model with 2 classes of sites against a model with 3 classes of sites. Degree of freedom = 2.

2x(L1-L0) = 2x[(-4692.7323) – (-4711.1080)] = 36.7513
d.f. = 37-35 = 2
=> 1.04608E-08

The test is significant, so there is positive selection. This model is very conservative, and can lack power under certain conditions.

3) M7-M8: this test also detects positive selection. We contrast a model with 10 classes of sites against a model with 11 classes of sites. Degree of freedom = 2.

2x(L1-L0) = 2x[(-4690.5454) – (-4838.3278)] = 55.842832
d.f. = 37-35 = 2
=> 7.47968E-13

The test is significant, so there is positive selection. However, this model can have problem power under certain conditions, and the following LRT is preferred.

4) M8-M8a: this is the latest test. We contrast a model with 11 classes of sites where positive is not allowed (dN/dS=1) against a model with 11 classes of sites where positive is allowed (dN/dS >=1). Degree of freedom = 1.

2x(L1-L0) = 2x[(-4690.5454) – (-4706.2685)] = 31.446092
d.f. = 37-36 = 1
=> 1.48446E-07

This is the preferred test, combining power and robustness.

3.2) Identification of sites 

As these tests are significant, we can move to the next step, which is the precise identification of sites under positive selection. If the previous step was not significant, we should not move to this stage.

In your mlc file, under the section of Model M2a and M8 (the only ones that allow positive selection), you will find a section called “Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)”. This section contains the list that have a BEB score [Pr(w>1)] higher than 50%. BEB values higher than 95% are indicated by * and sites with values higher than 99% are indicated by **. Sometimes, there is no site detected, which means there is probably a problem in your analysis or dataset if your test is signficant. Sometimes, you will find a lot of sites, which seems worrying, but it just means the average BEB (baseline) is slightly above 50%. The most interesting sites are those with a BEB>95% (* or **).

Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: ENSP00000364080)

            Pr(w>1)     post mean +- SE for w
     4 R      0.961*        2.858 +- 0.698
    36 F      0.671         2.150 +- 1.019
    53 L      0.999**       2.956 +- 0.610
    84 D      1.000**       2.957 +- 0.608
    97 G      0.649         2.098 +- 1.018
   112 V      0.969*        2.888 +- 0.695
   114 F      0.999**       2.955 +- 0.611
   115 R      0.743         2.367 +- 1.093
   116 G      1.000**       2.957 +- 0.609
   121 R      0.835         2.593 +- 0.993
   247 R      0.617         2.050 +- 1.110

The column correspondS to the position in the trimmed alignment (i.e. not related to the position in the reference sequence, which is ENSP00000364080, the human). However, the amino acid correspondS exactly to the reference sequence.

One of the site with the strongest BEB value is 84, with 1.000.  Its own dN/dS value is 2.957, with a standard deviation of 0.608. As I said previously, the sites given by CodeML don’t correspond to the human sequence. You can use the following script to extract the real position in the human sequence:

./ HLA_DQB1_subset.cds.mafft.fasta HLA_DQB1_subset.cds.mafft.trimal.cols "ENSP00000364080" 84
=> 84 89 D

=> This site 84 in CodeML (Trimal) correspondS to amino acid site 89 in the human sequence and codes for an aspartic acid.

# In total, we have six sites that are strongly interesting (BEB>95%): 4, 53, 84, 112, 114 and 116. Let’s repeat the same for all these sites:

site_list="4 53 84 112 114 116" 
for site in $site_list; do ./ HLA_DQB1_subset.cds.mafft.fasta HLA_DQB1_subset.cds.mafft.trimal.cols "ENSP00000364080" $site; done

  4   8 R
 53  58 L
 84  89 D
112 117 V
114 119 F
116 121 G

# We can do the same for sites with 50%<BEB<95%:

site_list="36 97 115 121 247"

for site in $site_list; do ./ HLA_DQB1_subset.cds.mafft.fasta HLA_DQB1_subset.cds.mafft.trimal.cols "ENSP00000364080" $site; done

 36  41 F
 97 102 G
115 120 R
121 126 R
247 252 R

# We can also plot the dN/dS value per sites. At the bottom of the rst file "HLA_DQB1_M0M1M2M3M7.rst.txt", extract the last part which looks like this and save as “beb.txt”:

   1 K   0.02943 0.13133 0.20099 0.20092 0.16262 0.11574 0.07524 0.04540 0.02539 0.01292 0.00002 ( 3)  0.395 +-  0.198
   2 A   0.00278 0.02154 0.05527 0.09166 0.12189 0.14109 0.14752 0.14134 0.12323 0.09274 0.06093 ( 7)  0.737 +-  0.535
   3 L   0.00726 0.04637 0.09959 0.13896 0.15621 0.15361 0.13717 0.11292 0.08526 0.05671 0.00595 ( 5)  0.551 +-  0.270
   4 R   0.00000 0.00000 0.00000 0.00003 0.00019 0.00082 0.00248 0.00582 0.01104 0.01731 0.96230 (11)  2.860 +-  0.696
   5 I   0.44686 0.24476 0.14583 0.08007 0.04205 0.02141 0.01061 0.00510 0.00234 0.00099 0.00000 ( 1)  0.168 +-  0.149
   6 P   0.34022 0.22595 0.16343 0.10859 0.06876 0.04207 0.02495 0.01430 0.00780 0.00391 0.00001 ( 1)  0.221 +-  0.187

1st column = position in the trimmed alignment.
2nd column = amino acid from the reference sequence.
3rd to 13th column = BEB score for each class (10 neutral + 1 allowing positive selection).
14th = most likely class.
15th = estimated dN/dS value at this position.
16th = standard deviation for this dN/dS value.

You can see that position 4, the most likely class is 11th (BEB=0.96) with a dN/dS = 2.86+-0.70.

# CodeML output uses a fixed delimitation. To parse it in R, we need to remove the space between bracket and number:

cat beb.txt | perl -pe "s/\( /\(/g" > tmp.txt; mv tmp.txt beb.txt

# Here are some commandS to use in R, to produce the following plot:

if (!require("ggplot2")) {
   install.packages("ggplot2", dependencies = TRUE)

df<-read.table("beb.txt", sep = "")

df$beb <- "No"
df$beb[df$V13 > 0.50] <- "Yes"

p <- ggplot(df, aes(V1, V15))
p + geom_point(aes(colour = factor(beb)))+
    geom_hline(yintercept = 1)+
    scale_color_manual(values = c("black", "red"))+
    labs(x = "Residue position")+
    labs(y = "Selective pressure [dN/dS]")+
    theme(axis.line = element_line(colour = "black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank())
ggsave("beb.png", height=3, width=4)

We can see that sites under positive selection represent only a small fraction, and most sites are under strong purifying selection.

4) Visualisation of sites in 3D structure

We found that sites seems randomly distributed according to their residue position. It would be interesting to see if they form a pattern in the 3D structure. Go download the following pdb file 1uvq:


Or download it from the webpage:

Then load it in PyMOL:

pymol 1UVQ.pdb &

# First, define the different molecules to their associated polypeptide chains:

select HLA_DQA1, chain A
select HLA_DQB1, chain B
select peptide,  chain C

# We highlight in cartoon/sticks and colour chains and peptide in different colours:

hide everything
show cartoon, HLA_DQA1
show cartoon, HLA_DQB1
show sticks, peptide
colour white, HLA_DQA1
colour grey60, HLA_DQB1
colour green, peptide

# To spice things up, the site numbering in many PDB files doesn’t correspond to the human sequence. So we have to renumber it:

renumber HLA_DQB1, 35

# Highlight sites with BEB>95%:

select sites_BEB95, HLA_DQB1 and resi 58+89+117+119+121
show spheres, sites_BEB95
colour yellow, sites_BEB95

# Highlight sites with 50%<BEB<95%:

select sites_BEB50, HLA_DQB1 and resi 41+102+120+126+252
show sticks, sites_BEB50
colour yellow, sites_BEB50

=> Most sites under positive selection (in yellow) are exactly in the binding site (in green), facing the target peptide. This example with the MHC has been widely described in the literature [Hugues AL et al. 1988]. To my disappointement, there are not so many exemples that selective pressure, 3D structural and experimental validation:

To have a nice finish as in the figure above, rotate the structure the way you want and type: 

bg_color white
select none
ray 2000
save 1UVQ_positive_sites.png

Et voila! I hope this tutorial was helpful.

Friday, 22 April 2016

Using package manager Homebrew (Mac OSX) for science / computational biology

I am using Mac OSX on a daily use for many reasons.

One reasons is that you can use the classical tools from the Windows world (Microsoft Office, Endnote) that don't exist on Linux. It is important to me as many of my external collaborators are using Windows, so it can be problematic when sharing documents. I don't exclude that I would shift one day to Linux.

The second reason, and I think it was one of the best strategy from Apple, was the transition to Unix in 2001. This greatly improved the compatibility and access to tons of scientific softwares. The installation of unix tools from source is usually done by the classical "./configure; make; sudo make install".

Hopefully, package managers exist to avoid this task, and to keep the system tidy, especially with library dependencies. At the very beginning, when I started my PhD in 2005, I was using Fink. I then switched to MacPort. Last year, I encountered some problems with MacPort and some versions of GCC. Everything was messy and I decided, as I switched to El Capitan from 10.8, to move to Homebrew. So I remove my MacPort installation and installed HomeBrew. And Homebrew is really great!

(Shaun Jackman told me on Twitter that he is maintaining the linux fork of Homebrew, available here: )

Some reasons I like it:
  • Very easy to use.
  • No need to use super-user, as it installs in the user directory.
  • It tries to use already libraries if possible.
  • There is one way for unix-style software and one way for .dmg package (cask).

To install it, you need:
  • MacOSX 10.9 ou +
  •  Xcode (the most recent). [NB: Xcode is not needed anymore. Save a lot of space!]
  • Command Line Tools (Install with command: xcode-select --install)

You can install Homebrew from there:

You just need this command, and homebrew will auto-install itself:

/usr/bin/ruby -e "$(curl -fsSL"

Then it is very easy to use in the Terminal. For example, if your are interested to install "Gimp":

brew search gimp => search for packages that contains the word "gimp".

brew info gimp => describe gimp package, and which dependencies are needed.

brew install gimp => install gimp.

brew list => show everything installed on your computer.

There are other sections in homebrew. For exemple, a category specific to games, or a category specific to science. To include those directly on your homebrew, you can tap them:

brew tap homebrew/games
brew tap homebrew/science

Now, you can install CodeML / PAML:
brew install paml

For example, here is what I have installed from the science category:

brew leaves | grep "science"


You will recognise many useful tools. Nice, isn't it?

However, there are still some tools that cannot be installed with Homebrew:

brew search biopython => No formula found for "biopython".
brew search jalview => No formula found for "jalview".
brew search pagan => No formula found for "pagan".
brew search njplot => No formula found for "njplot".
brew search netphorest => No formula found for "netphorest".
brew search vmd => not the one I search for. :/

(I will update this list once I installed them)

There is also the CASKROOM, which contains other packages, such as .dmg packages.

Same, you can merge with homebrew:
brew tap caskroom/cask

For example, to install Firefox:

brew cask install --force figtree
=> it will install Firefox in your and create a shortcut in /Users/yourname/Applications

If you want to install in /Applications:
brew cask install --appdir="/Applications" --force figtree

=> it will force the creation of the shortcut to /Applications. But I prefer to install in my own Applications folder, it is cleaner.

Here is my short list:

brew cask list | sort


It is easy to maintain and update all your package (except the ones from CASK, I don't know why)

brew doctor => check that everything is ok. It can display lot of warnings, especially if you previously installed programs by hand.

brew update => update your package list definition.

brew upgrade =>  upgrade all your outdated packages

brew upgrade $FORMULA => upgrade only the specific formula (i.e. mrbayes).

brew cleanup
brew cask cleanup

You can run the maintenance in one line:
brew update; brew upgrade; brew cleanup; brew cask cleanup

For the packages from CASK, here is how I update them (with the --force option):

brew cask install --force --srgb --with-cocoa emacs

brew cask install --force java
brew cask install --force arduino
brew cask install --force avogadro
brew cask install --force bittorrent
brew cask install --force cathode
brew cask install --force figtree
brew cask install --force filezilla
brew cask install --force gimp
brew cask install --force grandperspective
brew cask install --force google-earth
brew cask install --force handbrake
brew cask install --force kompozer
brew cask install --force kid3
brew cask install --force mplayer-osx-extended
brew cask install --force rstudio
brew cask install --force scribus
brew cask install --force silverlight
brew cask install --force slack
brew cask install --force teamviewer
brew cask install --force thunderbird
brew cask install --force unetbootin
brew cask install --force unrarx
brew cask install --force vlc
brew cask install --force vox

brew cask install --force --appdir="/Applications" firefox
brew cask install --force --appdir="/Applications" libreoffice
brew cask install --force --appdir="/Applications" skype

You can also do it in one line like this:
brew cask list | xargs brew cask install --force

That's it!

I strongly recommend to use Homebrew, especially if you start from a fresh system. Homebrew community is very active and new packages are put every weeks.


PS: other interesting blog posts: