Selecting a random subset of data or generating random numbers is a fairly common bioinformatics programming task. However, verifying correct behavior of software with a random component can be challenging for the uninitiated. Presumably the scientist writing the code would run the software on a handful of small examples and manually check to ensure the output is correct. But it’s not realistic to do this every time the software runs. How can one verify the behavior of such a program in an automated fashion? Alternatively, what if you find a case in which the program produces incorrect output? How do you reproduce that specific case for troubleshooting and testing?
The behavior of a programming language’s random number generator and related features can be predictable with the use of a random seed. Initializing a random number generator with a seed ensures that the same “random” numbers are produced each time. For example, if you have a program that samples 500 paired-end reads from random locations of a 100 kb sequence of genomic DNA, running the program multiple times with the same random seed should produce the exact same 500 reads. You don’t want this kind of behavior in a production end-user environment, but for development and testing this turns out to be very useful.
When it comes to writing research software that has a random component, I would make the following recommendations.
- Always include an option to set the random seed. Whether your software is a simple R function, a command-line Python script, or a robust C++ library, the interface should always include an option for the user to set the random seed. This allows them (and you) the ability to reproduce specific cases and troubleshoot or verify the software’s behavior.
- Always report the random seed. Whether or not the user provides a specific random seed, reporting the seed used is crucial for reproducing the software’s behavior. When the code does not explicitly set the seed, programming languages will typically use the current system time to set the seed internally, and it’s not always possible to determine the exact value used. Therefore, when the end user does not specify a specific random seed to use, a good approach is to generate a random number, report that random number to the user, and then re-initialize the random number generator using that value as a seed. Subsequent invocations of that program could then reproduce the behavior by using the same seed.
Here is an example: a Python script that does an admittedly trivial task involving random numbers, but demonstrates how to get predictable behavior with random seeds.
#!/usr/bin/env python import argparse import random import sys # Define command-line interface parser = argparse.ArgumentParser(description='Generate random integers') parser.add_argument('-m', '--max', type=int, default=100, help='maximum value; default is 100') parser.add_argument('-s', '--seed', type=int, help='random seed') parser.add_argument('n', type=int, help='# of random integers to generate') args = parser.parse_args() # Pick a random seed if the user did not provide one if not args.seed: args.seed = random.randint(0, sys.maxint) # Set and report the random seed random.seed(args.seed) print >> sys.stderr, 'Random seed:', args.seed # Do the trivial task print [random.randint(1, args.max) for i in range(args.n)]
And here is a demonstration of how the script works.