We aspire to better understand gene evolution and function, using statistical and computational methods. The key questions underlying our research are:

We tackle these problems by developing statistical and computational methods and applying them to large-scale genomic data. This process combines biological aspects in the early stages (e.g. problem statement, identifying relevant empirical observations, determining dependable benchmarks and controls), statistical, algorithmic, and computational aspects in the middle (e.g. model formulation, programming, scaling up), and biological aspects again at the end in the interpretation of the results.

Representative ongoing projects:

Orthology inference and applications

Orthologs are genes in different organisms that descended from the same ancestral gene in their last common ancestor. Accurate and comprehensive identification of these "same genes in different species" is a prerequisite for numerous biological studies, medical research and pharmaceutical applications.

Correspondence between 
reconcilied tree and orthology

Our main activities in this space are around OMA (Orthologous MAtrix), an effort to identify orthologous genes among publicly available genomes. With currently over 1700 genomes analysed, OMA is among the largest databases of orthologs. Its web interface is consulted hundreds of times a day, and is linked from several leading sequence databases, including UniProtKB, WormBase, and HGNC.

We are also interested in applications of orthology. For instance, to better characterise the relationship between gene sequence evolution and gene function, we tested the validity of the "ortholog conjecture", the notion that orthologs tend to be functionally more conserved than paralogs. The story behind this work was written in a guest post on Jonathan Eisen's blog.

Publications: relevant papers
Funding: ETH Zurich Research Grant (2011-2013), SIB Infrastructure Funds (2012-2016), Bayer CropScience (2013-2016)
Key collaborator: Gaston Gonnet (ETH Zurich), Max Telford (UCL), Marc Robinson-Rechavi (U of Lausanne), Maria Martin (EMBL-EBI), Henning Redestig (Bayer CropScience)
Further info: special issue on Orthology in Briefings in Bioinformatics.

Reconstrucing the tree of life: large-scale tree concordance analysis

Part of tree of life

Since Darwin, reconstructing the tree of life has been a major pursuit of biology. High-throughput genome sequencing is providing us with an abundance of molecular data, but we still struggle to resolve the deep phylogenies. Under current methods, adding more characters does not always improve phylogenetic resolution; and indeed, typical tree reconstruction efforts only involves a tiny fraction of all genes. In this project, we develop a phylogenetic tree building method that is sufficiently efficient to take into account most genes of each species, and that can handle a mixture of evolutionary histories. Using the majority of genes from a thousand genomes, we seek to infer the number of different trees that best capture the evolutionary history of species, to reconstruct these histories, and to visualise them in an insightful way.

Publications: relevant papers
Funding: SNSF Fellowship for Advanced Researchers (2011-2013), CoMPLEX/EPSRC studentship (2014-2017), SNSF Professorship (2015-2019)
Key collaborators: Nick Goldman (EMBL-EBI), Manuel Gil (Uni Zurich), Max Telford (UCL)

Critical assessment and verification in molecular evolution

Filtering alignment

Sequence alignment and phylogenetic tree reconstruction methods are among the most important contributions of bioinformatics to the life sciences. Both methods infer past events from current data, be it common ancestry among characters for alignments, or evolutionary relations among sequences for tree builders. Because of the inherently unknown nature of these past events, validation/comparison of the methods (and of their underlying models) is notoriously difficult. Real data validation is often limited to anecdotal evidence. In better cases, it consists in some goodness of fit measure (e.g. AIC). Even then, these measures are based on (implicit) assumptions, which themselves would need to be tested.

Recently, we have introduced real data tests for orthology inference and for sequence alignment. Our tests solve the validation problem indirectly—by assessing the compatibility of a method’s results with general, well-accepted principles or models. Methods that produce more compatible results are to be preferred. For instance, the "minimum duplication test" ranks alignment methods by assuming only that genes evolve along trees, and that the principle of parsimony applies to gene duplication events.

Publications: relevant papers
Key collaborators: Manuel Gil (Uni Zurich), Ge Tan (MRC Clinical Science Centre), Katzutaka Katoh (CBRC, Japan)

Quality of computationally inferred gene function annotations

bubblegram Gene Ontology (GO) annotations are a powerful way of capturing the functional information assigned to gene products. In the Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these "electronic GO annotations" are generally perceived as unreliable and routinely excluded from analyses. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. The key questions we pursue are: We have written the story behind some of this work in a blog post.
Publications: relevant papers
Funding: BBSRC TRDF (2014-2015), BBSRC industrial Case studentship (2015-2019)
Key collaborators: Laurent Gatto (Cambridge University), Henning Redestig (Bayer CropScience)
Last modified on November 19th, 2016.