Orthology practical with OMA

Author: Natasha Glover
Last updated Aug 2018

You are interested in studying gene families. There are exercises below, each using different ways to access OMA: the browser http://omabrowser.org, using OMA Standalone, and programmatically via the API.

Part 1a: OMA Browser

You ran a network analysis and found that the human gene with UniProt ID OR2L5_HUMAN is involved in an interesting pathway. Search for this gene on the OMA homepage.

  1. Based on the Gene Ontology annotations, what function is this protein probably involved in?
  2. Go to the orthologs table. How many 1:1 pairwise orthologs are there?
  3. How conserved is the domain architecture of these orthologs?
  4. Now have a look at the Hierarchical Orthologous Groups associated with this gene.
    1. What is the root level of this HOG, i.e. at which ancestral taxonomic level did this gene originate?
    2. How many genes are in this gene family?
    3. Which extant genomes have the most copies of this gene?
    4. Is there anything unusual about any species’ GC content?
    5. How many genes in this family (i.e. root HOG) are human genes?
    6. In which lineages did the duplications likely take place?
    7. How many genes in this family have a small gene length?
  5. Does this gene share any localised conserved synteny among any other species? If so, which ones?

Part 1b: OMA Browser

You recently read about carbohydrate-active enzymes (CAZymes) that are potentially involved in degradation of polysaccharides in A. bisporus when grown on compost. You want to know if this gene is conserved in Penicillium. Here is the protein sequence:

MLFKLASTVFLAQFFALTSAQTISGPFDCLPAGNSYTLCQNLWGRTSGVGSQSSTLVGSSGDSVSW STNWNWQNNQNSVKSYANIIADNAMGKQLSAVTSAPTSWSWSYETKSDPIRANVAYDLWLGASPVG APASRNSSYEIMVWLSRQGGIQPIGGPTASGIQLAGNTWTLWSGPNSNWQVLSFVSDTGDIPNFNA DFKEFFDYLVQNSGVSSQQYVQAIQAGEPFTGSANLVTHSYSVALN

Search for this protein on the OMA website.

Tip: search for the most similar sequence to this gene by selecting “Protein Sequence” from the drop down menu and pasting part or all of the sequence)

  1. Consider now the orthologs predicted by OMA. In which phylogeny are they present?
  2. Is this gene found in any Penicillium species?
  3. How many in-paralogs of this gene are there in Agaricus bisporus at the Agaricomycetes level?

Part 2: OMA standalone

You want to know how conserved the mushroom genome is with the Ascomycota clade. Perform the following tasks and answer the following questions.

Export the following genomes from omabrowser.org: Schizosaccharomyces pombe (strain 972 / ATCC 24843), Saccharomyces cerevisiae (strain ATCC 204508 / S288c), Agaricus bisporus.

Tips:

Examine the contents of the archive. We have exported the precomputed all-by-all for our 3 fungal genomes in order to save time when running OMA standalone. The genomes are stored in the DB/ folder.

  1. In what format are the genome files? 
  2. Which species has the biggest genome and how many predicted proteins are in the file?
  3. In what folder are the precomputed all-by-all alignments stored in?

Now we want to add our own, newly sequenced genome. (For demonstration purposes, this genome is reduced to cut down on computation time.) Add the following dummy fungal genome to your dataset: mygenome.fa
Tip: download and copy the genome to the DB folder

Now run OMA standalone on the 4 fungal genomes.

Tip: although it may be more convenient to install OMA standalone on your system, you don’t have to: simply navigate inside the OMA.2.2.0 folder, then run the following: bin/oma

Now the Output folder is created, check it out. OMA has estimated a rough species tree from orthologous groups, using a distance-based method. (If you know the phylogeny of the species, you can give a predefined species tree in the parameters file)

  1. Examine the tree, to which species is your new genome closest to?
    Tip: View tree using the online tree viewer PhyloIO (http://phylo.io)
  2. Not trusting this lowly distance tree reconstructed by OMA standalone, you set out to perform a proper tree inference: retrieve all orthologous groups containing your genes from your genome, align the sequences, concatenate the alignments, and run a maximum likelihood tree method to infer a species tree.

Examine the pairwise orthologs:

  1. Which pair of genomes has the most orthologous pairs of genes? How many?
  2. How many 1:1 orthologous genes does YEAST have with SCHPO?
    Tip: grep -v "#" [file] | cut -f 5 | sort | uniq –c
  3. Compared to AGABI, how many YEAST orthologs have undergone duplication since the AGABI-YEAST speciation?
    Tip: number of many:1 or many:many pairs

Part 3: using the OMA R package

If you want to retrieve information from the OMA browser in R, you can also use the OmaDB package.

Install the OmaDB bioconductor package, and load it.
  1. Identify the protein sequences from part 2 using the OmaDB API. Tip: use the function mapSequence().
  2. Retrieve the sequence of its orthologs using the OmaDB API.