Database sharding in Rbbt
I’ve implemented this functionality to help manage the dbNSFP database, with over 190M rows. In brief it works like this example (from the tests).
require 'rbbt-util'
require 'rbbt/persist/tsv'
TmpFile.with_file do |dir|
sharder = Persist::Sharder.new dir, true, :float_array, 'HDB' do |key|
key[-1]
end
size = 1_000_000
sharder.write_and_read do
size.times do |v|
sharder[v.to_s] = [v, v*2]
end
end
end
The sharder takes the normal parameters of Persist.open_database
, that is,
the path, whether to open it for writing, the serializer, and the type of DB.
In addition it takes the sharding function, which is applied to each key. The
result of the sharding function index one of the databases in the sharder path,
which is a directory. New shards are created on-demand. The Sharder implements
the same interface that other Persist
Adapters. I have not tried to date to
have it comply with the TSV interface, but it should in theory work out of the
box.