Student Projects 2016-2017

On this page, I briefly describe potential student projects. Depending on circumstances, these projects may constitute a bachelor or master thesis, a summer internship, or a honours project. Note that this list is not exhaustive; I am happy to discuss in person further project opportunities.

What you bring:

What we offer:

Interviews of previous lab interns are available here.

Projects take place at UNIL or UCL. If you are interested, or have any question, please send an email to or

Available projects

Detecting whole genomes duplication through comparative genomics biology coding

Whole Genome Duplications (WGD) denote events that are increasing the ploidy numbers (numbers of genome copies) of an organism. Some WDG have been already detect within eukaryotes like in Plants (monocot/dicot), Vertebrates (Teleost fish) or Yeast and are highly contributing to morphological and physiological adaptations [1,2]. The resulting duplicated genes called "ohnologs" can brought a selective advantage (neo-functionalization) toward adaptation. Detecting WGD is a non trivial task but can help to discover many underlying evolutionary histories and understand how Life have shaped organisms to their extant forms.

This project aims to develop and implement a method to detect WGD within a set of organism. The idea of this project is to recreate ancestral genomes content (ancestral genes and their related descendant extant forms) for a set of organisms and to investigate on abnormal changes of the ancestral genes numbers across taxonomic ranges. To proceed the ancestral genes reconstruction the project will rely on Hierarchical Orthologous Groups (HOGs) that is an orthology framework to infer genes families (set of genes descendant for an ancestral genes at a specific taxonomic range) within a comparative genomic setup.

The computational project will consist into 3 parts:

  1. WGD detection: The first part of the project aims to use HOGs to built a map between taxonomic ranges and related ancestral genomes composition for a given set of genomes. Once the ancestral genomes composition is reconstruct, you will be in charge of designing a method to detect uneven pattern in the evolution of the genes material of ancestral genomes through the species tree.
  2. Benchmark: The second part of this project is to assess the quality of your results using a benchmark test (that you will have to implement). You will have to verify your predictions using known cases of WGD, taken from literature, and compare if your predictions are similar to expected results. For this purpose, you can use for example the ancient WGD duplication of S. cerevisiae described by K. Wolfe
  3. Case studies: The last part of this project will consist on using your pipeline on large genomic setups in order to detect unknow whole genome duplication and also include new idea in this pipeline like using synteny.

The project will be tailored to the abilities and interests of the student

Whole genome duplication is a common occurrence, including vertebrates, fungi, and plants. Furthermore, this process is also known to commonly happen in cancer evolution. Successful advances will be incorporated in the OMA database, which is used by thousands of researchers worldwide, and might be published in a scientific paper.

The student will gain better knowledge of programming (Python), comparative genomics analysis, benchmarking, pipeline design.

[1] Wolfe, K.H., 2015. Origin of the Yeast Whole-Genome Duplication. PLoS biology, 13(8), p.e1002221.
[2] Marcet-Houben, M., Marina, M.-H. & Toni, G., 2015. Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker's Yeast Lineage. PLoS biology, 13(8), p.e1002220.
[3] Altenhoff, A.M. et al., 2013. Inferring hierarchical orthologous groups from orthologous gene pairs. PloS one, 8(1), p.e53786.
[4] Byrne, K.P. & Wolfe, K.H., 2005. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome research, 15(10), pp.1456-1461.

Statistical distribution of distances between phylogenetic trees codingmaths

Many comparisons between evolutionary scenarios rely on measuring how similar two phylogenetic trees are ― for instance whether distinct genes tell the same evolutionary history and if not, by how much they disagree? There are several distinct measures, and the choice of the distance reflects the cause of the disagreement - for instance we can compare two trees in terms of the number of clades common to both, or the difference in speciation times.

The objective is to characterize a given tree distance in statistical terms and describe its relation to tree size, applying it to an real case scenario. For instance, for a given observed distance between two trees, we can calculate its statistical significance by comparing it to its null distribution. We then can apply this statistical test to an empirical collection of trees from a phylogenomic analysis, in order to quantify the phylogenetic signal and detect outliers.

Plan: (1) Choose an appropriate distance and implement it or use an existing implementation; (2) Study the statistical properties of the chosen distance, using a generator of random trees. Here, measures like mode and maximum values, as well as the whole histogram, are recorded as a function of the number of leaves. (3) Describe the distance behaviour in presence of missing data. (4) Consider the feasibility of designing a quasi-random tree generator for the given distance. (5) given a large collection of trees from a genomic study, calculate the distance between all tree pairs and compare them to the expected values.

Distances are used not only to quantify the disagreement between evolutionary hypotheses, but also to compare the performance between different tree inference models, and to help finding tree estimates that best describe all sampled trees in phylogenomic studies [1]. The student will learn statistical and computer science concepts, and will implement it into a software of practical utility in phylogenetic inference.

[1] Huggins PM, Li W, Haws D, Friedrich T, Liu J, and Yoshida R. Bayes Estimators for Phylogenetic Reconstruction. Syst Biol (2011) 60 (4): 528-540. doi: 10.1093/sysbio/syr021

Visualisation and comparison of multiple phylogenetic trees coding visualisation

Recent years have witnessed the development of numerous software packages for visualising phylogenetic trees. But the problem is complicated when more than one tree topology has to be considered at the same time. One solution consists in combining the trees into networks (Huson & Bryant 2006). Another is to draw all trees superimposed (Bouckaert 2010). The goal of this project is to consider these and other solutions to visualise forests in an insightful manner---ideally one that fosters interactive exploration of the data.

Some of our ideas and starting point can be seen in the tool developed in the lab

Bouckaert, RR. DensiTree: making sense of sets of phylogenetic trees. Bioinformatics 2010; 26:1372-1373
Huson, DH, Bryant, D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 2006; 23:254-267
Robinson O, Dylus D, Dessimoz C. interactive viewing and comparison of large phylogenetic trees on the web. Molecular Biology and Evolution, 2016, 33:8, 2163-2166
List of tree visualisation software on Wikipedia

OMA domains biology coding

OMA is a well-established database identifying orthologs among complete genomes. Until now, our basic evolutionary unit has been entire genes. This works well to investigate evolutionary forces that act on entire genes (gene duplication, speciation, lateral transfers, etc.), but cannot describe an important source of functional innovation: gene fusion and fissions. For that, we need to study evolution at finer granularity than genes (down to protein domains, or at the extreme even single base pairs). We have developed a pipeline that infers the domain architecture of all genes in OMA based on HMM profiles from the PFam database, and that computes orthology between domains. The goal of this project is to devise and evaluate methods/algorithms to infer the evolutionary history of domain-level events across multiple species.

Altenhoff, AM, Schneider, A, Gonnet, GH, et al. OMA 2011: orthology inference among 1000 complete genomes. Nucl Acids Res 2011; 39:D289-94
Sjölander, K, Datta, RS, Shen, Y, et al. Ortholog identification in the presence of domain architecture rearrangement. Brief Bioinform 2011; 12:413-422

Identifying well conserved but poorly characterised genes biology

Most experimental work in biology is concentrated on familiar genes in a handful of model organisms. As a result, many potentially interesting genes are neglected.

The project aims at identifying well conserved (and thus potentially important) but poorly characterised genes in clades of interest.

We will exploit the OMA resource, which analyse evolutionary relationships among the genes from 1800 species, to identify such genes. The project is entirely computational. It will involve work in a command-line environment, elementary programming/scripting in Python. Prior knowledge is not necessary, but a high degree of motivation and willingness to learn is. The project will be tailored to the abilities and interests of the student and is available either as First-step or as Master project.

This will help experimentalists to better prioritise their efforts, and to characterise new genes. In particular, the host lab collaborates with scientists interested in improving the efficiency of crop species and on elucidating the biological features that are shared across all animals and thus define them.

Altenhoff,A.M., Škunca,N., Glover,N., Train,C.-M., Sueki,A., Piližota,I., Gori,K., Tomiczek,B., Müller,S., Redestig,H., et al. (2015) The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res., 43, D240-9.

Emergence of tissue-specific isoforms biology coding

The OMA database developed in our lab identifies evolutionary relationships among genes all across the Tree of Life. Currently OMA only considers one transcript per gene. In the future, we would like to be able to relate isoforms (alternative transcripts) between genes, so to pinpoint the emergence of new isoforms in evolution. In particular, identifying the emergence of tissue-specific isoforms would be of very high interest to the broader community.

In this project, the student will perform a proof-of-concept analysis, by focusing on a subset of vertebrate genomes in OMA and publicly-available RNA-seq data. First, the student will reanalyse the genomic data taking into account all transcript data. Second, they will map tissue-by-tissue RNA-seq data onto these transcripts in an attempt to gauge the conservation or divergence of expression levels across orthologous transcripts.

Clark et al., Discovery of tissue-specific exons using comprehensive human exon microarrays, Genome Biology 2007, 8:R64 paper here.

Robustness of orthology inference with respect to database completeness coding

There are many methods to infer orthologs (genes that decend from the same ancestral gene in the last common ancestor of the species in question). Usually, these methods expect complete genomes/proteomes (i.e. one aminoacid sequence for each gene in each genome considered) as input. However, in many circumstances (e.g. RAD-sequencing, metagenomics, transcriptome sequencing, low-quality genome), the set of input sequences cannot be assumed to be complete. In this project, the student will evaluate how well orthology inference methods fare when they are fed incomplete genomes as input. Because orthology is a prerequisite for many different kind of analyses, the answer to this relatively simple question will be of high relevance to practictioners in the field.

Question: Can OMA calculate orthologs for single query sequence? on BioStars.
Dalquen DA, Altenhoff AM, Gonnet GH, Dessimoz C. 2013. The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study. PLoS One 8:e56925. paper here.

DLIGHT -- Detecting lateral gene transfer coding biology

A few years ago, we developed an algorithm to identify lateral gene transfers. Despite promising preliminary results, our attention was soon diverted to other, more pressing problems. Meanwhile, DLIGHT ('Distance Likelihood-based Inference of Genes Horizontally Transferred') has mostly languished in neglect. But this could change, because of the confluence of 3 circumstances: (1) preliminary work from recent (unpublished) student projects suggests that DLIGHT is highly competitive and thus worth being pursued; (2) we have evidence that lateral gene transfer is highly disruptive to orthology prediction (an area in which we have some interest), so identifying lateral transfer is important; (3) you, dear prospective intern, could help us finish the comparative study of DLIGHT with other state-of-the-art tools, and help us deploy DLIGHT on our computer cluster to identify laterally transferred genes among the hundreds of bacterial genomes in the OMA database.

Dessimoz, C, Margadant, D, Gonnet, GH. DLIGHT - Lateral Gene Transfer Detection Using Pairwise Evolutionary Distances in a Statistical Framework. RECOMB 2008, Lect Notes Comput Sc, Springer 2008; 4955:315-330
Dalquen, DA, Anisimova, M, Gonnet, GH, et al. ALF--A Simulation Framework for Genome Evolution. Mol Biol Evol 2011

Additional project opportunities

Last modified on November 22nd, 2016.