T05 - Protein Evolution Analysis: on the Use of Phylogenetic Trees


Brandon Invergo - EMBL-EBI, Cambridge, UK (invergo [at] ebi.ac.uk)

David Ochao       - EMBL-EBI, Cambridge, UK (ochoa [at] ebi.ac.uk)

Romain Studer   - EMBL-EBI, Cambridge, UK (rstuder [at] ebi.ac.uk)




Homologous proteins, that share a common ancestor, can be classified into families. These homologs can be orthologs, that were separated by a speciation event, or paralogs, that were separated by a duplication event. Within a protein family, all members are related by a phylogenetic tree, which consists of a root (the last common ancestor of the protein family), nodes (which are speciation/duplication events), branches (whose lengths correspond to the number of substitutions) and tips (which correspond to modern sequences). The tree is helpful for inferring the evolutionary history of the protein family. For example, we can reconstruct the ancestral sequences at each node of the tree. These ancestral sequences can be used for homology modelling, to reveal the ancestral 3D structures, or synthesised in vitro. Or we can compare trees to reveal similar evolutionary history between protein families (co-evolution). Manipulating tree topologies are complex operations that require tools to perform operations such as reading, pruning, collapsing, rerooting. These operations can be done with programs with graphical user interfaces (GUI). However, in the area of large-scale data, in which hundreds or thousands of trees may be manipulated, it is impractical to use such programs.  To this end, new software/libraries have been developed to deal with such large data sets in an automated manner.



This tutorial will present recent concepts regarding the evolution and adaptation of protein sequences. It will be divided into three sections, in which we will present methods relating to the use of phylogenetic trees to infer protein function. These sections will be 1) using scripts to manipulate trees, 2) using ancestral sequence reconstruction to infer history of a protein family and 3) the detection of coevolution between protein families. Each section will have an introduction explaining the concepts underlying any analysis methods, and a discussion of the power and limitations of different methods and tools used to explore these concepts and which participants will learn how to use during the practical for that section.



The first part will focus on tools to detect adaptation in protein sequences. It will start with a brief introduction on multiple alignment and phylogenetic trees followed by a more detailed presentation of tools available to estimate selective pressures and detect adaptation in protein sequences with CodeML / PAML. The second part will focus on the reconstruction of ancestral sequences and ancestral structures by homology modelling. The third part will focus on the identification of co-evolution between protein families. The organiser will provide protein datasets, or participants can bring their own sequences. Then, they will be able to use the different programs/libraries in a practical way.


1) Performing phylogenetic analyses with Biopython

The efficient analysis of large phylogenetic data sets necessitates robust scripting tools. We can cite Newick utilities (Junier et al., Bioinformatics 2010), R package ‚Äúape‚ÄĚ: Analyses of Phylogenetics and Evolution (Paradis et al., Bioinformatics 2004), ETE [a Python Environment for phylogenetic Tree Exploration] (Huerta-Cepas, et al. BMC Bioinformatics 2010) or Bio.Phylo for Biopython (Talevich et al., BMC Bioinformatics 2012). To this end, this part of the tutorial will focus on using the Bio.Phylo in order to explore, manipulate and analyse phylogenetic trees. Biopython is a library for the Python programming language that implements a variety of commonly needed methods for bioinformatics analysis, such as handling sequences and sequence alignments. The Bio.Phylo module of Biopython consists of methods specific to phylogenetic analyses.

The tutorial will begin with an overview of reading and writing phylogenetic tree files as well as a review of the methods for visualising and producing publication-quality trees. Next, methods will be presented for programmatically traversing, exploring and modifying a tree. Finally, a brief overview will be given of the available interfaces to external programs for generating phylogenetic trees, such as PhyML, which facilitate the production of pipelines.

Lastly, special attention will be given to the Bio.Phylo interface to the PAML software package (>5,000 citations, Ziheng Yang, UCL), which include the widely used programs CodeML and BaseML. These programs are typically used to estimate the rates of fixation of non-synonymous (dN) and synonymous (dS) substitutions. Their ratio (dN/dS) is commonly used as an estimate of the evolutionary forces acting on a protein: a dN/dS ratio less than one indicates purifying selection, a ratio of one indicates neutral evolution and a ratio greater than one points to positive (adaptive) selection. Furthermore, CodeML provides the capability to perform several statistical tests of positive selection at specific codons and specific phylogenetic branches. While these programs are notoriously difficult to reliably include in an analysis pipeline, the Bio.Phylo.PAML sub-module simplifies the dynamic generation of control files and the parsing of results files. This part of the tutorial will begin with a basic theoretical overview of the methods implemented by the PAML programs, focusing on CodeML. Next, an introduction to the programs' basic usage will be presented. Finally, the functionality of the Bio.Phylo.PAML sub-module will be explained.

The participants will have the opportunity to build a basic phylogenetic analysis pipeline using Biopython, PhyML and PAML: starting with a set of gene sequence alignments, trees will be generated, modified and analysed in an automated manner.


2) Ancestral sequence reconstruction and homology modelling

Ancestral sequence reconstruction allows the identification of the ancestral character at a particular point of the evolution. The inference of ancestral structure allows a better understanding of protein evolution and protein function. Recent studies using ancestral sequence reconstruction have focused on old proteins (Groussin et al., Biol Lett. 2013), nuclear receptors (Harms et al., PNAS 2013) or RuBisCO enzyme (Studer et al., PNAS 2014). A recent review explored this area: ‚ÄúHarms MJ, Thornton JW. Evolutionary biochemistry: revealing the historical and physical causes of protein properties. Nat Rev Genet. 2013 Aug;14(8):559-71‚ÄĚ.

Different methods exist, and CodeML allows performing such analysis under maximum likelihood. CodeML provides many informative details in its output, such as the probability of a particular amino acid to be present at this point of evolution. The participants will learn how to read the CodeML output and how to convert them into ancestral sequences, with all the potential problems they could encounter. The reconstructed sequences will be then used as target for homology modelling and the structure will be visualised with PyMOL (DeLano, 2002).


3) Studying molecular co-evolution

The detection of molecular co-evolution remains as one of the most challenging applications of phylogenetic trees beyond phylogeny reconstruction (Juan et al., Nat Rev Genet 2013). The hypothesis that functionally related molecules share similar phylogenetic trees has been largely studied at protein-protein and protein-DNA level (Kuo, Genome res 2010), generating a plethora of different methodologies. The most popular approach, Mirrortree, predicts the similarity of a pair of phylogenetic trees by calculating the Pearson correlation between the cophenetic distances of the corresponding ortholog sequences (Pazos and Valencia, Protein Eng. 2001). This na√Įve approach gave birth to an area of research that improved our understanding of molecular co-evolution during the last decade.

In this part of the workshop, the participants will be guided on the study of protein co-evolution using the Mirrortree Server (http://csbg.cnb.csic.es/mtserver/) (Ochoa et al., Bioinformatics 2010). The attendees will familiarize themselves with the generation of appropriate phylogenetic trees given the particular characteristics of this type of analysis, as well as the difficulties of correct taxonomic sampling (Herman, Ochoa, et al., BMC Bioinformatics 2011). The attendees will be introduced to the possibility of analysing high-quality phylogenetic trees for highly detailed analysis or using the automatic pipeline to quickly generate phylogenetic trees. By using the web interface, the participants will be able to compare a pair of phylogenetic trees, improve their visualisation, select sub-regions for further analysis or export their selection in a completely interactive way. All these features will allow the users to perform robust co-evolution analysis in different pairs of phylogenetic trees, as well as to detect evolutionary events such as Horizontal Gene Transfers.

The last part of the tutorial will explore the limitations of studying individual pairs of proteins and will dig into the advantages of using context-based approaches for detecting specific signals of co-evolution (Juan et al., PNAS 2008). Finally, the attendees will explore some of the protein interaction networks produced by these genome-wide approaches, in order to understand the final applicability of this type of study.

Level: Introductory

Schedule (Sunday, September 7th):

 Morning session: Performing phylogenetic analyses with Biopython.

 9:00 Talk (45min): Performing phylogenetic analyses with Biopython (B. Invergo, EBI)

 9:45 Practical (30min): Performing phylogenetic analyses with Biopython (B. Invergo, EBI)

10:15 Coffee break

10:45 Practical (1h30): Performing phylogenetic analyses with Biopython (B. Invergo, EBI)

12:15 Lunch


Afternoon session: ancestral sequence reconstuction and molecular co-evolution.

13:15 Talk (30min): Ancestral sequence reconstruction (R. Studer, EBI)

13:45 Practical (1h30): Ancestral sequence reconstruction (R. Studer, EBI)

15h:15: Coffee break

15h45: Talk (30 min): Studying molecular co-evolution (D. Ochoa, EBI)

16h15: Practical (45min): Studying molecular co-evolution (D. Ochoa, EBI)

17:00 End


Intended audience:

- Evolutionary biologists, biochemists, computational biologists, structural biologists.


Possible prerequisites:

- Unix command line.

- We strongly encourage participants to learn the basics of the Python programming language.


CV of the organisers:

Brandon Invergo is a post-doctoral fellow at the European Bioinformatics Institute (EMBL-EBI) and the Sanger Institute in the laboratories of Drs. Pedro Beltrao (EBI), Oliver Billker (Sanger), Julian Rayner (Sanger) and Jyoti Choudhary (Sanger). He is studying the structure and evolution of post-translational modification networks in malaria parasite Plasmodium species. Brandon obtained his PhD in the laboratory of Prof. Jaume Bertranpetit at the Institute of Evolutionary Biology (Pompeu Fabra University - CSIC, Barcelona, Spain). His thesis focused on the molecular evolution of the proteins comprising the mammalian visual phototransduction system and the influence of the structure and dynamics of the system on the action of natural selection. He specialised in the analysis of protein evolution using CodeML and in systems biology techniques such as network analysis and dynamic system modelling. During the course of his research, he developed a Python library for automating phylogenetic analysis with CodeML, which he later expanded and contributed to the Biopython project to become Bio.Phylo.PAML. He has previously taught tutorial courses as an assistant lecturer on introductory programming for biologists and the use of R for statistical analysis (MSc level), and he has been an invited lecturer for an introductory course on systems biology (BSc level).


David Ochoa is a post-doctoral fellow in the laboratory of Pedro Beltrao at the European Bioinformatics Institute. His current research focuses on the study of the functional and evolutionary aspects of the human phosphoproteome. He obtained his PhD in the laboratory of Florencio Pazos (CNB, Madrid). During that time, his research was focused on the improvement of mirrortree-based approaches, in order to detect protein interactions. Besides developing the Mirrortree Server, he participated in other co-evolution studies in collaboration with Alfonso Valencia’s group (CNIO, Madrid). On the first, he analysed the effect of incorporating predicted solvent accessibility to the co-evolution-based prediction of protein interactions. On the second, he studied the optimal taxonomic sampling required to predict different types of interactions. Moreover, during the last 3 years, he participated as assistant professor in the Master of Bioinformatics and Computational Biology organized by the Universidad Complutense de Madrid and recently by the Instituto de Salud Carlos III.


Romain Studer is a senior post-doc research scientist in the laboratory of Dr. Pedro Beltrao, EMBL-EBI. He obtained his PhD in the laboratory of Prof. Marc Robinson-Rechavi (Lausanne, Switzerland) and work as post-doc with Prof. Christine Orengo (University College London). His PhD work in evolutionary bioinformatics focused on protein evolution in the context of whole-genome duplication. He specialised in various methods to identify adaptation at genetic levels, such as CodeML/PAML. Using CodeML and other softwares, he studied the prevalence of positive selection in Vertebrates and the characteristics of shifts in evolutionary rate in Animals proteins. Based on the CodeML pipeline he built for his own research, he participated in the development of the web resource Selectome, a database of positive selection (http://selectome.unil.ch/). He participated in two studies revealing the relationship between structure and evolution, both in a nuclear receptor in insects and in a MHC family in birds. At University College London, using ancestral sequence reconstruction and homology modelling, he studied the evolution and adaption of RubisCO in plants. Romain was also assistant for the practical of bioinformatics (B.Sc/M.Sc levels) at University of Lausanne, as well as involved in the EMBnet course of phylogeny. He also tutored a previous tutorial at ECCB‚Äô12, entitled ‚ÄúProtein Evolution: From Sequence to Structure to Function‚ÄĚ. He now continues to work on the topic of protein evolution with respect to the 3D structures and post-translational modification.

Latest News

Awards ECCB'14 awards have been announced during the closing ceremony of ECCB'14 on Wednesday September... Read more
Next conferences: ISMB/ECCB & JOBIM 2015 Next ECCB will be held in conjunction with ISMB in Dublin, Ireland, July 10-14, 2015: ISMB/ECCB... Read more

Silver Sponsors

Other Exhibitors