I’m working on a post about efficient random sampling with C, but today I came up with a pretty straightforward solution using SQLite3. Since the dataset I had in mind may be too big to fit into a reasonable amount of memory, and SQLite3 has a documented C API, that other post may never come to fruition. We’ll see…
Anyway, imagine you have some data in a SQLite3 database.
standage@localhost: ~$ sqlite3 temp.db SQLite version 3.6.12 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> CREATE TABLE ids (id int primary key); sqlite> .import data.txt ids sqlite> SELECT * FROM ids; 1 2 3 4 5 6 7 8 9 10
Create an additional column to hold the value indicating whether each value has been sampled.
sqlite> ALTER TABLE ids ADD COLUMN subset int not null default 0; sqlite> SELECT * FROM ids; 1|0 2|0 3|0 4|0 5|0 6|0 7|0 8|0 9|0 10|0
Now here’s the magic. This example samples 5 values at random. If you want to sample more or less , then just change that number and voilà!
sqlite> UPDATE ids SET subset = 1 WHERE id IN (SELECT id FROM ids ORDER BY RANDOM() LIMIT 5); sqlite> SELECT * FROM ids; 1|0 2|0 3|1 4|1 5|0 6|1 7|0 8|1 9|1 10|0
Actually, there’s no magic at all, it’s quite simple.