Rbbt

Ruby bioinformatics toolkit

View project onGitHub

Translating IDs

This post will show how to use the Rbbt workflow Translation to translate between identifier formats. You will need a working ruby installation, and a recent version of the gems rbbt-util, rbbt-rest and rbbt-sources (e.g., gem install rbbt-util).

Prepare infrastructure

Install the translation workflow doing rbbt workflow install Translation. Bootstrap the installation issuing rbbt workflow cmd Translation bootstrap. That will prepare the system for identifier translation of Homo sapiens (Hsa) for the builds of may2009 (hg18) and jun2011 (hg19) and the most recent build. The default organism used is Hsa, which stands for the most recent build of H. sapiens.

To avoid building all the resources from scratch, before the bootstrap use the following command rbbt file_server add Organism http://se.bioinfo.cnio.es. This will download precompiled resources from the server. Incices and caches will still need to be prepared.

Alternatively, setup a remote Translation workflow by doing rbbt workflow remote add Translation http://se.bioinfo.cnio.es/Translation

Use

You can now translate a list of gene ids as follows:

rbbt workflow task Translation translate --format "Ensembl Gene ID" --genes "TP53|MDM2"

or, using the following command, which retains the correspondance between ids:

rbbt workflow task Translation tsv_translate --format "Ensembl Gene ID" --genes "TP53|MDM2"

You may change the format to any of the formats in the corresponding identifier file:

rbbt tsv info ~/.rbbt/share/organisms/Hsa/identifiers

For simplicity you may also use:

rbbt workflow task Translation formats 

The most common are:

  • Ensembl Gene ID
  • Associated Gene Name
  • UniProt SwissProt Accession

Note that CASE is ALWAYS IMPORTANT

You may use the organism codes ‘Hsa’, ‘Hsa/may2009’ and Organism.default_code(‘Hsa’). Other organisms are supported: ‘Mmu’ and ‘Sce’. Any Ensembl archive date can be specified, but it will require preparing that infrastructure as well.

Genes can be submitted from the STDIN by using the character -

cat gene_names.txt | rbbt workflow task Translation tsv_translate --format "Ensembl Gene ID" --genes -

Tips

If you find yourself translating entities often you might want to set up some aliases. For instance

rbbt alias gene_ensembl workflow task Translation translate -f "Ensembl Gene ID" -o Hsa -g -

so that you can now type:

cat genes.txt | rbbt gene_ensembl

Likewise:

rbbt alias gene_name workflow task Translation translate -f "Associated Gene Name" -o Hsa -g -

The Genomics workflow has a task called names that takes a TSV file and substitutes all identifiers with human readable genes. The identifiers that it identifies include genes, pathways, protein domains and several other entities.