Asciinema
If you would like to see a terminal cast with this same information, please visit the following casts:
Rbbt tutorial 1.- HTS workflow basic usage
Rbbt tutorial 2.- Sample workflow
Rbbt tutorial 3.- downstream analysis
Setup
This guide assumes that you have access to a working installation of Rbbt prepared for genomics analysis. If not, check out the alternative ways to setup an Rbbt installation. If you decide to use the singularity image instead of a native installation, consider downloading this version which has pre-installed most of the software tools you will need.
Rbbt basics
The rbbt
command is the gateway to using all Rbbt functionalities. Chain
together sub-commands to reach the particular functionality you are interested
in. Type rbbt
to see the list of sub-commands, which is at the end of the
help page. The sub-commands in white (e.g. migrate) are final commands and have
their own documentation, the ones in blue (e.g. workflow) have further
sub-commands. Type rbbt workflow
to see the workflow
sub-commands; we will
be using one of them to execute workflow tasks: rbbt workflow task
.
There should be a number of workflows already available for you, type rbbt
workflow list
to see them. One of them should be the HTS
workflow (High
throughput sequencing). This workflow allows you to perform analyses for DNA
and RNA-Seq like alignment and variant calling. To see the tasks available for
the HTS
workflow type rbbt workflow task HTS
. To learn how to use any of
these tasks, such as the mutect2
task, type rbbt workflow task HTS mutect2
--help
. You can see from the output of the last command that the mutect2
task is actually a workflow in itself that builds upon several other tasks. In
particular it has a task performing the actual calling with GATK’s Mutect2, and
others that are used to filter out artifacts and low quality variants.
Different steps of the workflow consume different inputs, which may use default
values if omitted. Note that option flags have shorthand forms, for instance
-h
is equivalent to --help
; we will be making use of this shorthand forms
arbitrarily throughout these tutorials to shorten commands1.
A tiny synthetic example
Gather the file bundle and unpack it on a work directory. This is a synthetic sample generated using the NEAT genReads tool. You should see several directories containing the FASTQ files, aligned BAMs and golden BAMs. Golden BAMs have reads aligned to exactly the place they where simulated from, while the aligned BAMs are the result of running the standard alignment pipelines with BWA following the GATK best practices.
Test the HTS workflow on synthetic data
We can use the synthetic sample to start learning about the HTS workflow. Let’s
begin by doing variant calling with Mutect2. Type rbbt workflow task HTS
mutect2 -h
to examine the input options, in particular: tumor
, normal
, and
reference
. To try it out point tumor
and normal
to the corresponding
aligned BAM files from the synthetic sample, not the golden ones2, and specify
hg38
as the reference
. Before you actually run it let us first examine the
work that Rbbt intends to do, add the flag --provenance
to see the dependency
tree. Each line in the provenance tree is indented to reflect the dependency
structure, and has four components: status of the step, workflow, task name,
and step path. Initially they should all be in the status notfound
. Now we
can remove the --provenance
flag and actually run the workflow.
Let it run for a while, like a couple of minutes, and interrupt it with
<ctrl>-C
. Now display the provenance again and take a look at what was
completed and what remains. Now run the workflow again and watch it resume from
the last completed step. This time however, try adding the flag --log 0
to
increase the log level and show DEBUG
info, which includes the output from
the different command-line tools. To make 0
the default log level type rbbt
log 0
; the default is to show INFO
level logs, which corresponds to level
4
.
Examining the result and the meta-data
Once the workflow is complete the resulting VCF would have appeared in
STDOUT
; the logs go to STDERR
so they will not interfere with the output
and you can safely pipe it. The -pf
flag, which is the short form for
--printpath
, will show the path to the result file instead of the content.
Lets get it to examine the meta-data (note that we could also get it from the
top of the provenance tree). Notice how alongside the result file is a .info
file3. This file contains meta-data that describe the
provenance of the result, the inputs used, time spent, versions of tools used,
etc. Try it out by typing rbbt workflow info <path>
, which shows just basic
information, including input options specific to that job. The flags --all
shows extended information, and the flag --recursive
traverses the
dependencies to collect all the inputs used across. To examine the provenance tree
from a job result file, instead of at the moment of running the workflow, you can do
rbbt workflow prov <path>
.
One of the benefits of synthetic samples is that we have a truth set to
validate that our pipelines are working correctly. Use the task compare_vcf
to compare your variant calling with the truth set used by NEAT rbbt workflow
task HTS compare_vcf --vcf_1 <your file path> --vcf_2 <path to
truth/somatic.vcf>
. It’s should show a large number of common
variants; it’s
OK that they are common but different
, this only means that the additional
information of the variants, like variant allele frequency, is different, which
is in fact entirely missing on the truth VCF file.
One can also see the status of running jobs, lock files, indices being created,
and other things using the command rbbt system status HTS --all
, where
workflow HTS
can be substituted by any other workflow, or the word all
to
report all workflows; the flag --all
indicates that all jobs must be shown,
instead of only failed ones. To cleanup do rbbt system clean HTS
, running
jobs will not be erased or interfered with, except on an HPC environment where
jobs might be executing on a different node and might seem aborted to the
login-node.
Input options, hashes, redoing work, and configuration keys
I hope by now you have noticed one thing: that Rbbt does not place result files
for a workflow execution isolated from other workflow executions, they all
ended up under ~/.rbbt/var/jobs/
. The idea is that workflow executions
are treated as deterministic given the inputs, so the same workflow issued with
the same inputs can reuse the same results. A few precautions are set in place
to avoid mishaps, the path associated to a step is determined by digesting the
values of the inputs, so different inputs receive a different hash4
In order to force that part of the work is redone you must first clean it up.
You can use the flag --clean
to clean just the task you are calling, leaving
the dependencies as they where, or the flag --recursive_clean
to clean all
the dependencies and redo the entire job. Alternatively you can also use the
flag --clean_task
to clean specific dependencies and force that the process
is updated from there on.
If a dependency is updated the downstream steps become not up-to-date, and will
show. To avoid unnecessary computation this does not trigger the need to redo
the downstream steps unless the flag --update
is used when running the
workflow. Try running the mutect2_pre
task, which is the one where Mutect2
actually does the variant calling, and then the mutect2
task with and without
the --update
flag. Then try running the mutect2
task with --clean_task
mutect2_pre
. But before you do so, read the next paragraph!
Rbbt makes a distinction between inputs options that dictate what the result
is, and performance options that dictate how the results is computed. Input
options get digested into the path hash and lead to different provenance trees,
performance options don’t alter the dependency tree and are actually specified
through something called config_keys
. For example, Mutect2 has no internal
support for concurrency, however it can be implemented by dividing the work in
different genomic intervals, an approached often called sharding. To activate
sharding we can specify the following flag --config_keys 'shard true
mutect2,cpus 10 mutect2'
. This is interpreted as follows, the configuration
key shard
should be set to true when queried with the key mutect2
and the
key cpus
set to 10
when queried also by the key mutect2
. If you turn the
log level to 0 and scrutinize the output you will see the logs for when these
keys are retrieved. Try running this command
rbbt workflow task HTS mutect2 -n <path to normal BAM> -t <path to tumor BAM> -r hg38 --clean_task mutect2_pre
-pf --config_keys 'shard true mutect2,cpus 10 mutect2
. It’s a very small
sample but it should show some difference. Checkout the meta-data for the
mutect2_pre
step to check that the keys show up. Configuration keys can be
placed in particular locations so that one does not have to write them all the
time, but we will see that later.
Using the Sample workflow
When running the mutect2
task you didn’t need to specify where reference data
and their indices where; Rbbt found them for you. A similar thing can be done
for the input data, once they are properly organized Rbbt can find them for
you. This is done with the Sample workflow, that puts together a portfolio of
different queries that can be asked about samples (i.e. patients), figuring out
the right workflow to answer them.
Preparing the synthetic data for the Sample workflow
In order to incorporate a new sample to the Sample
workflow the data must be
placed somewhere that Rbbt can find, one such place is in the home directory of the user,
under ~/.rbbt/share/data/studies/
. For example, to prepare the synthetic data
we create the folder ~/.rbbt/share/data/studies/ARGO-NEAT
, inside we need to have
the following directory tree:
~/.rbbt/share/data/studies/ARGO-NEAT/ ├── WGS │ ├── ARGO-NEAT │ │ ├── tumor_read1.fq.gz │ │ └── tumor_read2.fq.gz │ └── ARGO-NEAT_normal │ ├── normal_read1.fq.gz │ └── normal_read2.fq.gz └── options └── reference
This creates the ARGO-NEAT
study, which contains the sample also named
ARGO-NEAT
along with its matched normal ARGO-NEAT_normal
. Here we copy or
link the FASTQ files. The options/reference
contains the text hg38
to make
sure this is the reference used to analyze this sample, unless otherwise
specified.
It’s a matter of preference, but I find that in practice is often practical to
include all the data that was provided as it was provided originally in some
other directory, data
in this case, and link to it as Rbbt likes it. This way
one can move this directory around easily and never lose track of the original
contents.
~/.rbbt/share/data/studies/ARGO-NEAT/ ├── WGS │ ├── ARGO-NEAT │ │ ├── tumor_read1.fq.gz -> ../../data/FASTQ/tumor/tumor_read1.fq.gz │ │ └── tumor_read2.fq.gz -> ../../data/FASTQ/tumor/tumor_read2.fq.gz │ └── ARGO-NEAT_normal │ ├── normal_read1.fq.gz -> ../../data/FASTQ/normal/normal_read1.fq.gz │ └── normal_read2.fq.gz -> ../../data/FASTQ/normal/normal_read2.fq.gz ├── data │ ├── BAM │ │ ├── normal.bai │ │ ├── normal.bam │ │ ├── normal.bam.md5 │ │ ├── tumor.bai │ │ ├── tumor.bam │ │ └── tumor.bam.md5 │ ├── FASTQ │ │ ├── normal │ │ │ ├── normal_read1.fq.gz │ │ │ └── normal_read2.fq.gz │ │ └── tumor │ │ ├── tumor_read1.fq.gz │ │ └── tumor_read2.fq.gz │ ├── golden_BAM │ │ ├── normal_golden.bam │ │ ├── normal_golden.bam.bai │ │ ├── tumor_golden.bam │ │ └── tumor_golden.bam.bai │ ├── known_sites │ │ ├── 1000G_phase1.indels.vcf.gz │ │ ├── 1000G_phase1.indels.vcf.gz.tbi │ │ ├── 1000G_phase1.snps.high_confidence.vcf.gz │ │ ├── 1000G_phase1.snps.high_confidence.vcf.gz.tbi │ │ ├── Miller_1000G_indels.vcf.gz │ │ ├── Miller_1000G_indels.vcf.gz.tbi │ │ ├── af-only-gnomad.vcf.gz │ │ ├── af-only-gnomad.vcf.gz.tbi │ │ ├── dbsnp_146.vcf.gz │ │ ├── dbsnp_146.vcf.gz.tbi │ │ ├── panel_of_normals.vcf.gz │ │ ├── panel_of_normals.vcf.gz.tbi │ │ ├── small_exac_common_3.vcf.gz │ │ └── small_exac_common_3.vcf.gz.tbi │ ├── reference │ │ ├── hg38.dict │ │ ├── hg38.fa │ │ ├── hg38.fa.amb │ │ ├── hg38.fa.ann │ │ ├── hg38.fa.bwt │ │ ├── hg38.fa.byNS.interval_list │ │ ├── hg38.fa.fai │ │ ├── hg38.fa.gz │ │ ├── hg38.fa.gz.amb │ │ ├── hg38.fa.gz.ann │ │ ├── hg38.fa.gz.bwt │ │ ├── hg38.fa.gz.byNS.interval_list │ │ ├── hg38.fa.gz.fai │ │ ├── hg38.fa.gz.gzi │ │ ├── hg38.fa.gz.pac │ │ ├── hg38.fa.gz.sa │ │ ├── hg38.fa.gzi │ │ ├── hg38.fa.pac │ │ └── hg38.fa.sa │ └── truth │ ├── germline.vcf │ ├── somatic.vcf │ └── tumor.vcf └── options └── reference
Mutect2 using the Sample workflow
With the synthetic data prepared for the Sample
workflow, running mutect2
is as simple as typing rbbt workflow task Sample --workflows HTS mutect2 --jobname
ARGO-NEAT
. This will not only run the variant calling, it will also align the
BAM files from the FASTQ files.
There are two things to note. The first is that the name of the sample is
indicated as the --jobname
of the job, which we will see has the additional
benefit of helping manage the results. The second thing is the presence of this
flag --workflows HTS
which indicates that the Sample workflow must be
incorporate functionalities from the HTS
workflow, which is where alignment
and variant calling functionalities are defined. We will illustrate later how
this will allow us to easily extend our functionalities.
The provenance of the workflow we plan to use includes some new files, you can
consult it using the --provenance
flag. The --help
flag also shows this
provenance in a more succinct form that does not removes certain repetitions:
Sample#mutect2 HTS#mutect2 HTS#mutect2_clean HTS#mutect2_filters HTS#mutect2_pre HTS#BAM_orientation_model HTS#contamination HTS#BAM_pileup_sumaries Sample#BAM HTS#BAM HTS#BAM_rescore HTS#BAM_sorted HTS#BAM_duplicates HTS#BAM_bwa HTS#uBAM HTS#mark_adapters Sample#BAM_normal
We can recognize a portion of our previous workflow, that HTS
part that does
the Mutect2 calling. In addition we observe a few new things. First that on top of
HTS#mutect2
is Sample#mutect2
this is a trivial task used only for
housekeeping purposes, it simply asks the previous one to run and links to it’s
results, as we will see this makes the results more organized, hiding away all
the intermediate steps under the HTS
workflow and leaving the Sample
workflow for only the interesting results. The next thing we see is a new
workflow branch under Sample#BAM
. People initiated in the GATK best practices
for aligning short reads will recognize these steps. A similar branch would
hang under Sample#BAM_normal
, but was abridged in this report.
Try comparing the dependency graph we just examined with what you get from
using the --provenance
flag. Notice the paths of the files. Under
~/.rbbt/share/var/jobs/Sample
we will only see four files:
/home/mvazque2/.rbbt/var/jobs/Sample/mutect2/ARGO-NEAT.vcf /home/mvazque2/.rbbt/var/jobs/Sample/BAM/ARGO-NEAT.bam /home/mvazque2/.rbbt/var/jobs/Sample/BAM/ARGO-NEAT_normal.bam /home/mvazque2/.rbbt/var/jobs/Sample/BAM_normal/ARGO-NEAT.bam
The last one is the same as the one above it. Notice that, since we didn’t override any of the default values, there is no presence of any hashes in the job names. You can always know where to find your BAM files, VCFs, or any other results for any sample; just like Rbbt does!
The command we presented in this section didn’t have any config_keys
. Let us
revisit this issue, since there are now a few more keys to control the
execution. To avoid having to type them all the time place the following text into
the file ~/.rbbt/etc/config_profile/HTS
spark false gatk shard true gatk cpus 18 shard cpus 14 haplotype cpus 18 mutect2 cpus 15 samtools_index cpus 15 bwa cpus 20 sort samtools_sort true bam_sort threads 15 samtools_sort_threads max_mem 2G samtools_sort_max_mem
You may adjust the values to your particular environment, or you have have
different profiles with different values. The first line disables the SPARK
version of all GATK functions. Turning it to true will use the SPARK version
for any tool that has such a version available. In our tests found that these
can have worse performance than the sharded version. If for instance you turn
on SPARK for the MarkDuplicates step, by specifying spark true
MarkDuplicates
, Rbbt will skip the sorting step, since the SPARK version
already outputs sorted reads. Otherwise SPARK and non-SPARK verions are mostly
equivalent. All sharding will be done over 18 different processes, except for
haplotype calling which is done over 14. Sorting of the BAM file is done using
samtools instead of GATK for performance reasons.
We are now ready to issue our workflow rbbt workflow task Sample -W HTS
mutect2 -jn ARGO-NEAT -ck HTS
. Note how when a config key consists only of a
single token, HTS
in this case, it loads a file with that name found under
etc/config_profile
directory.
-
shorthand forms of flags, in particular for task input options, are often computed from the long form of the inputs of tasks, and any collisions that might happen are automatically resolved in order of appearance. This makes it so that the same input might have different shorthand forms depending on the workflow task being executed. Please consult the help for any workflow task before using the shorthand forms. ↩
-
The golden files where generated using a minified version of the reference, to assure that the size of the sample is very small. This unfortunately makes it incompatible with using a regular hg38 reference. To analyze the golden BAM files you would need to point the
reference
and theknown_sites
to their minified versions provided in the file bundle. ↩ -
For reasons of performance the
.info
file is in a serialized form particular to Ruby, although Rbbt can be instructed to make them YAML by specifying the environment variableRBBT_INFO_SERIALIZER=YAML
, but it’s not recommended. ↩ -
Rbbt can be instructed to not digest the inputs but list them as-is in the path by setting the environment variable
RBBT_INPUT_JOBNAME=true
, but this is also not recommended. ↩