We aspire to better understand gene evolution and function, using statistical and computational methods. The key questions underlying our research are:
- How can we best extrapolate our current knowledge of molecular biology, concentrated in just a handful of model organisms, to the rest of life?
- Conversely, how can we exploit the wealth and diversity of life to better understand human biology and disease?
- Can we meaningfully summarise the evolutionary history of species into a small number of tree topologies that capture both the vertical inheritance and most important events of non-vertical inheritance?
We tackle these problems by developing statistical and computational methods and applying them to large-scale genomic data. This process combines biological aspects in the early stages (e.g. problem statement, identifying relevant empirical observations, determining dependable benchmarks and controls), statistical, algorithmic, and computational aspects in the middle (e.g. model formulation, programming, scaling up), and biological aspects again at the end in the interpretation of the results.
Representative ongoing projects:
- Orthology inference and applications
- Reconstrucing the tree of life: large-scale tree concordance analysis
- Critical assessment and verification in molecular evolution
- Quality of computationally inferred gene function annotations
Orthology inference and applications
Orthologs are genes in different organisms that descended from the same ancestral gene in their last common ancestor. Accurate and comprehensive identification of these "same genes in different species" is a prerequisite for numerous biological studies, medical research and pharmaceutical applications.
Our main activities in this space are around OMA (Orthologous MAtrix), an effort to identify orthologous genes among publicly available genomes. With currently over 1700 genomes analysed, OMA is among the largest databases of orthologs. Its web interface is consulted hundreds of times a day, and is linked from several leading sequence databases, including UniProtKB, WormBase, and HGNC.
We are also interested in applications of orthology. For instance, to better characterise the relationship between gene sequence evolution and gene function, we tested the validity of the "ortholog conjecture", the notion that orthologs tend to be functionally more conserved than paralogs. The story behind this work was written in a guest post on Jonathan Eisen's blog.
Funding: ETH Zurich Research Grant (2011-2013), SIB Infrastructure Funds (2012-2016), Bayer CropScience (2013-2016)
Key collaborator: Gaston Gonnet (ETH Zurich), Max Telford (UCL), Marc Robinson-Rechavi (U of Lausanne), Maria Martin (EMBL-EBI), Henning Redestig (Bayer CropScience)
Further info: special issue on Orthology in Briefings in Bioinformatics.
Reconstrucing the tree of life: large-scale tree concordance analysis
Since Darwin, reconstructing the tree of life has been a major pursuit of biology. High-throughput genome sequencing is providing us with an abundance of molecular data, but we still struggle to resolve the deep phylogenies. Under current methods, adding more characters does not always improve phylogenetic resolution; and indeed, typical tree reconstruction efforts only involves a tiny fraction of all genes. In this project, we develop a phylogenetic tree building method that is sufficiently efficient to take into account most genes of each species, and that can handle a mixture of evolutionary histories. Using the majority of genes from a thousand genomes, we seek to infer the number of different trees that best capture the evolutionary history of species, to reconstruct these histories, and to visualise them in an insightful way.
Funding: SNSF Fellowship for Advanced Researchers (2011-2013), CoMPLEX/EPSRC studentship (2014-2017), SNSF Professorship (2015-2019)
Key collaborators: Nick Goldman (EMBL-EBI), Manuel Gil (Uni Zurich), Max Telford (UCL)
Critical assessment and verification in molecular evolution
Sequence alignment and phylogenetic tree reconstruction methods are among the most important contributions of bioinformatics to the life sciences. Both methods infer past events from current data, be it common ancestry among characters for alignments, or evolutionary relations among sequences for tree builders. Because of the inherently unknown nature of these past events, validation/comparison of the methods (and of their underlying models) is notoriously difficult. Real data validation is often limited to anecdotal evidence. In better cases, it consists in some goodness of fit measure (e.g. AIC). Even then, these measures are based on (implicit) assumptions, which themselves would need to be tested.
Recently, we have introduced real data tests for orthology inference and for sequence alignment. Our tests solve the validation problem indirectly—by assessing the compatibility of a method’s results with general, well-accepted principles or models. Methods that produce more compatible results are to be preferred. For instance, the "minimum duplication test" ranks alignment methods by assuming only that genes evolve along trees, and that the principle of parsimony applies to gene duplication events.
Key collaborators: Manuel Gil (Uni Zurich), Ge Tan (MRC Clinical Science Centre), Katzutaka Katoh (CBRC, Japan)
Quality of computationally inferred gene function annotationsGene Ontology (GO) annotations are a powerful way of capturing the functional information assigned to gene products. In the Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these "electronic GO annotations" are generally perceived as unreliable and routinely excluded from analyses. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. The key questions we pursue are:
- How can we systematically and quantitatively assess the reliability of electronic GO annotations?
- Which current inference strategy yields the best predictions?
- In particular, how do evolutionary-based strategies compare with profile-based strategies?