Database sharding in Rbbt
I’ve implemented this functionality to help manage the dbNSFP database, with over 190M rows. In brief it works like this example (from the tests).
The sharder takes the normal parameters of Persist.open_database
, that is,
the path, whether to open it for writing, the serializer, and the type of DB.
In addition it takes the sharding function, which is applied to each key. The
result of the sharding function index one of the databases in the sharder path,
which is a directory. New shards are created on-demand. The Sharder implements
the same interface that other Persist
Adapters. I have not tried to date to
have it comply with the TSV interface, but it should in theory work out of the
box.