TSV
Tab Separated Value files (TSV) is one of the most common formats in biology, so it is very convenient to have easy ways to manipulate and work with them. A proper TSV file for Rbbt is a representation of how entities, listed as the first column, are associated to different values, specified in the rest of the columns. Every column is identified by a header at the top of the file. The most convenient way to programmatically access this information would be to do something like this:
Where key
is the entity we are interested in and field
is the column which
we wish to query. The most similar structure in programming languages is the Hash
.
The Rbbt uses hashes to load TSV files, but has the option to replace them with a
Tokyocabinet DB transparently for fast access.
The TSV in Rbbt–one of its most important components–strives to make this possible for any type of data, regardless of it size, formatting details, or provenance. Having successfully achieved this, we do not need to use databases and decide on database schemas, just query the data directly: the TSV module will take care of everything so you have a very fast access to your data. The cost of it is disk space and severe lags as the infrastructure gets built the first time its needed.
Classification of TSV files
Typical TSV files can be classified into four classes.
The following code opens reads a TSV file as a :single
TSV. Here
each entry of the resulting hash contains a single value.
- Result
A: B a: b
In this next case the keys are not associated with a single value, but with a
list of values. This TSV are of type :list
.
- Result
A ValueB: B A ValueC: C a ValueB: b a ValueC: c
In the following example, the TSV file lists multiple values for a single
field: type :flat
.
- Result
A values: B, BB, BBB a values: b, bb, bbb
Finally, :double
TSV have values which are lists of lists.
- Result
A ValueB: B, BB, BBB A ValueC: C, CC, CCC a ValueB: b, bb, bbb a ValueC: c, cc, ccc