Rbbt

Ruby bioinformatics toolkit

View project onGitHub

Database sharding in Rbbt

I’ve implemented this functionality to help manage the dbNSFP database, with over 190M rows. In brief it works like this example (from the tests).

require 'rbbt-util'
require 'rbbt/persist/tsv'

TmpFile.with_file do |dir|
  sharder = Persist::Sharder.new dir, true, :float_array, 'HDB' do |key|
    key[-1]
  end

  size = 1_000_000
  sharder.write_and_read do
    size.times do |v| 
      sharder[v.to_s] = [v, v*2]
    end
  end
end

The sharder takes the normal parameters of Persist.open_database, that is, the path, whether to open it for writing, the serializer, and the type of DB. In addition it takes the sharding function, which is applied to each key. The result of the sharding function index one of the databases in the sharder path, which is a directory. New shards are created on-demand. The Sharder implements the same interface that other Persist Adapters. I have not tried to date to have it comply with the TSV interface, but it should in theory work out of the box.