Category: C

Debugging in C

When I break out the GNU debugger (gdb command), it is unwillingly. I am dragged, kicking and screaming inside, to the realization that I have a substantial memory management issue in my C program that no amount of print statements will help me wrap my mind around. And only rarely does gdb actually help me isolate my issue. More often than not, I sink lots of time into that dark abyss, only to return to my print statements and banging my head against the wall until I recognize the problem and necessary solution.

Call me old fashioned, but the vast majority of my debugging is done with print statements. I’d like to think they’re thoughtfully placed to maximize diagnostic power, but honestly it’s usually just a messy process that sometimes generates more confusion before it yields enlightenment. Notwithstanding, this is still the quickest and most effective approach I’ve found to troubleshooting.

I recently came across a small utility that assists with debugging in C. It’s a single header file that will drop in easily to any C project, with a non-restrictive MIT license that permits any use (including commercial) as long as attribution is made to the original author.

What do I like about this little utility? What does it give me over my trusty fprintf statements? Not a whole lot honestly, but what it does provide is nice.

  • As mentioned before, it’s dead simple to integrate. Just drop it into your includes directory and use it immediately.
  • It uses some C11 magic to grab and print the variable name for you. More than half of the time I spend writing debugging print statements is spent formatting strings like ID=%s length=%d, so having this handled automagically is a huge potential time saver.
  • All debug statements can be disabled with a simple #define statement.

Most of the programming I’ve been doing recently has been in Python, so I haven’t integrated this into any active projects yet. But I think I’ll give this a shot for the reasons mentioned above.

Advertisements

Using kseq.h with stdin

I recently published a tutorial on command-line interface design on figshare. In one section, I stated that a program’s CLI could be improved by allowing the user to provide input via stdin rather than an input file (and print output to stdout rather than an output file), enabling the user to stitch the program together with other programs and shell scripts if they so desire.

As I was considering how I might implement this design in a C program I wrote recently, I cam across a the following issue. When writing C programs, I typically use Heng Li’s kseq.h library for parsing sequence data. The basic usage is as follows.

gzFile fp = gzopen(seqfilename, "r");
kseq_t *seq = kseq_init(fp);
while(kseq_read(seq) >= 0)
{
  // process and/or store the sequences
}
kseq_destroy(seq);
gzclose(fp);

However, this posed a challenge if I wanted to use kseq.h to parse sequences from stdin, since the stdin file pointer is not a compatible argument for gzopen. I decided I would look more into the zlib library and see if I could find an open function that accepted a file pointer instead of a file name.

I wasn’t able to find one that accepted file pointers, but I did find one that accepted file descriptors. After a brief aside to remind myself of the subtle differences between file pointers and file descriptors, I came up with a solution along these lines.

FILE *instream = NULL;
if(inputfile == NULL)
  instream = stdin;
else
  instream = fopen(inputfile, "r"); // don't forget error checking

gzFile fp = gzdopen(fileno(instream), "r");
kseq_t *seq = kseq_init(fp);
while(kseq_read(seq) >= 0)
{
  // process and/or store the sequences
}
kseq_destroy(seq);
gzclose(fp);

The instream object is a file pointer that can refer to stdin or another open file handle. The file descriptor for this file pointer is obtained using the fileno function, and is then provided to the gdzopen function.

Using this approach, you can use kseq.h to parse your sequence data either from a file or from the standard input, improving the usability and flexibility of your tool!

Variable argument lists with ANSI C

As a rule, all functions in C must be explicitly defined: that is, the code must indicate the type of the value returned by the function (if any), and the number and types of any variables passed to the function. There are, however, exceptions to this rule, the most notable of which (in my opinion) is the printf function.

printf("Howdy!\n");
printf("Program '%s' has %d arguments\n", argv[1], argc);
printf("Locus %s[%lu, %lu] has %lu genes\n", seqid, start, end, numgenes);

Here, the printf function is being called 3 different times, with 3 completely different parameter profiles, and yet it compiles and runs correctly.

Today I was looking at how I can have similar functionality in functions that I define. This led me to the stdarg.h library (part of the C standard library). Using stdarg.h, you can define variable parameter list functions using the ... token. Then, you can use the va_list data type to refer to a variable-length argument list, and step through those arguments using the va_start and va_end functions. Below is a simplified example of one of the ways I have used this functionality in my current project.

#include <stdarg.h>
#include <stdio.h>

void print_error_message(char *format, ...)
{
  fputs("[Error] message: ", stderr);
  va_list ap;
  va_start(ap, format);
  vfprintf(stderr, format, ap);
  va_end(ap);
  fputs("\n", stderr);
}

int main(int argc, const char **argv)
{
  print_error_message("you gave the program '%s' %d arguments", argv[0], argc - 1);
  return 1;
}

I can compile and run the example program like so.

standage@iMint ~ $ gcc -Wall -o test test.c 
standage@iMint ~ $ ./test 
[Error] message: you gave the program './test' 0 arguments
standage@iMint ~ $ ./test 1 2 3
[Error] message: you gave the program './test' 3 arguments
standage@iMint ~ $

If you wanted to write the message to a string (for later user) instead of printing it directly to a file handle, something like this would work.

void print_error_message_to_string(char *message, char *format, ...)
{
  va_list ap;
  va_start(ap, format);
  vsnprintf(message, sizeof(message), format, ap);
  va_end(ap);
}

Parsing Fasta files with C: kseq.h

The C programming language is quickly becoming my favorite. For many tasks, an interpreted language like Perl is still my first choice, since I can usually write the script very fast and superfast performance isn’t always a huge priority. However, now that I am becoming increasingly comfortable and quick coding in C, I’m finding more and more tasks where it’s worth spending a little bit more time coding to get the performance benefit I want.

I appreciate the understanding that comes from implementing everything from scratch, but I also appreciate the common programmer warning not to reinvent the wheel. For example, I often implement data structures from scratch, but I am not quite ready to invest the time required to implement a Fasta parser that is capable of handling all the many different cases that might come up. This wasn’t a big deal for me though, since most of the code I’ve been writing lately is designed to analyze gene annotations in GFF3 format, not genome sequences in Fasta format.

Well, this changed recently. I was working on some code as part of a collaboration with another department. The code needed to be fast (read: implemented in C) and it involved sequence analysis (read: has Fasta parser). I was forced to finally get serious about exploring the different options that exist out in the open source space. There are a few, but I settled with the kseq.h library. The library is essentially a self-contained C header file, so it’s extremely portable. Also, it can handle both Fasta and Fastq input, either in plain text or gzip-compressed. I’ve been pleased so far with this library, and although I’ve only used it for one project, I plan on using it in the future and would recommend it to anyone who’s looking.

Here is some sample source code from the kseq.h home page.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
// STEP 1: declare the type of file handler and the read() function
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{
	gzFile fp;
	kseq_t *seq;
	int l;
	if (argc == 1) {
		fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
		return 1;
	}
	fp = gzopen(argv[1], "r"); // STEP 2: open the file handler
	seq = kseq_init(fp); // STEP 3: initialize seq
	while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence
		printf("name: %s\n", seq->name.s);
		if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
		printf("seq: %s\n", seq->seq.s);
		if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
	}
	printf("return value: %d\n", l);
	kseq_destroy(seq); // STEP 5: destroy seq
	gzclose(fp); // STEP 6: close the file handler
	return 0;
}

Function pointers in C

I cut my programming teeth on Java, and I have a deep appreciation for principles of object-oriented design. One of my projects as a graduate student, however, has given me the opportunity to do a lot of coding in C. I now have a deep appreciation for the performance that C offers over Java, C++, Perl, etc, and the understanding that comes from implementing data structures from scratch. One thing that initially intimidated me about C is that, unlike Java and C++, there is no built-in syntax for object-orientation. However, after seeing some good examples, I realized that object-orientation is a principle and not a language construct, and I’ve had no problems implementing these principles in my C code.

The biggest lessons I’ve had to learn with C have to do with memory management. Java hides this all from the programmer, but with C, the programmer is responsible to free every piece of memory he allocates. One concept I’ve found very useful in this regard is the C language function pointer construct. A function pointer is a variable whose value can be changed–it can refer to different functions at different stages of program execution. This is extremely useful in a variety of cases, but I primarily use it when writing destructors for my data structures. A generalized data structure like a linked list or a hash map is agnostic to the type of data it contains. That means when it’s time to free the memory occupied by the data structure, the data structure itself doesn’t know how to free the memory occupied by each element it contains. Hard-coding a particular free function would severely limit the usefulness of the data structure.

Enter function pointers. With function pointers, my data structures do not need to know which function to call to free all of the associated memory–they simply make a call to the function pointer, which the programmer has pointed at the correct free function to call at that moment.

Syntax

The syntax of a C function pointer is pretty simple.

return_value_type (*function_pointer)(arg1_type, arg2_type, ..., argn_type);

For example, if you wanted to define a function pointer that accepts 3 integers as arguments and returns a float, you would define it like so.

float (*my_fp)(int, int, int);

Now at this point, you can use the my_fp variable to call any function that takes 3 ints and returns a float. For example, if you have two compatible functions defined like this…

float mean(int a, int b, int c)
{
  return (a + b + c) / 3.0;
}

float transform(int a, int b, int c)
{
  int sum = a + b + c;
  return (a + c - log(b)) / log(sum);
}

…then you can assign the function pointer to call either of them and invoke those functions through the function pointer, like so.

my_fp = &mean;
float first = my_fp(1, 3, 9);

my_fp = &transform;
float second = my_fp(1, 3, 9);

printf("first: %.4f, second: %.4f\n", first, second);
// the output would be "first: 4.3333, second: 3.4704"

In action

The example above is pretty silly and trivial, so I also wanted to show how I am actually using function pointers in my code. Below is the destructor function for a hash map class I recently wrote. The first argument to the destructor function is a pointer to the hash map data structure itself, and the second argument is a function pointer to be used for freeing memory occupied by all of the objects contained in the hash map.

void hashmap_delete(Hashmap *hash, void (*valuefreefunc)(void *))
{
  HashmapItem *item, *temp;
  int i;

  if(hash == NULL)
    return;

  // Iterate through the hash table
  for(i = 0; i < hash->size; i++)
  {
    item = hash->table[i];
    while(item != NULL)
    {
      temp = item;                  // Each item in the hash table is a key/value pair. The keys are all char arrays, so we can 
      item = item->next;            // simply call "free" to release that memory. However, the hash map doesn't know what data 
      free(temp->key);              // type the values are, so it calls the function pointer. It is therefore the programmer's 
      if(valuefreefunc != NULL)     // job to make sure that the function pointer is pointing to the appropriate free function 
        valuefreefunc(temp->value); // for hash maps containing objects of a particular data type.
      free(temp);
    }
  }

  free(hash->table);
  free(hash);
}