Random sampling with SQLite3

I’m working on a post about efficient random sampling with C, but today I came up with a pretty straightforward solution using SQLite3. Since the dataset I had in mind may be too big to fit into a reasonable amount of memory, and SQLite3 has a documented C API, that other post may never come to fruition. We’ll see…

Anyway, imagine you have some data in a SQLite3 database.

standage@localhost: ~$ sqlite3 temp.db
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> CREATE TABLE ids (id int primary key);
sqlite> .import data.txt ids
sqlite> SELECT * FROM ids;
1
2
3
4
5
6
7
8
9
10

Create an additional column to hold the value indicating whether each value has been sampled.

sqlite> ALTER TABLE ids ADD COLUMN subset int not null default 0;
sqlite> SELECT * FROM ids;
1|0
2|0
3|0
4|0
5|0
6|0
7|0
8|0
9|0
10|0

Now here’s the magic. This example samples 5 values at random. If you want to sample more or less , then just change that number and voilà!

sqlite> UPDATE ids SET subset = 1 WHERE id IN (SELECT id FROM ids ORDER BY RANDOM() LIMIT 5);
sqlite> SELECT * FROM ids;
1|0
2|0
3|1
4|1
5|0
6|1
7|0
8|1
9|1
10|0

Actually, there’s no magic at all, it’s quite simple.

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s