Using kseq.h with stdin

I recently published a tutorial on command-line interface design on figshare. In one section, I stated that a program’s CLI could be improved by allowing the user to provide input via stdin rather than an input file (and print output to stdout rather than an output file), enabling the user to stitch the program together with other programs and shell scripts if they so desire.

As I was considering how I might implement this design in a C program I wrote recently, I cam across a the following issue. When writing C programs, I typically use Heng Li’s kseq.h library for parsing sequence data. The basic usage is as follows.

gzFile fp = gzopen(seqfilename, "r");
kseq_t *seq = kseq_init(fp);
while(kseq_read(seq) >= 0)
{
  // process and/or store the sequences
}
kseq_destroy(seq);
gzclose(fp);

However, this posed a challenge if I wanted to use kseq.h to parse sequences from stdin, since the stdin file pointer is not a compatible argument for gzopen. I decided I would look more into the zlib library and see if I could find an open function that accepted a file pointer instead of a file name.

I wasn’t able to find one that accepted file pointers, but I did find one that accepted file descriptors. After a brief aside to remind myself of the subtle differences between file pointers and file descriptors, I came up with a solution along these lines.

FILE *instream = NULL;
if(inputfile == NULL)
  instream = stdin;
else
  instream = fopen(inputfile, "r"); // don't forget error checking

gzFile fp = gzdopen(fileno(instream), "r");
kseq_t *seq = kseq_init(fp);
while(kseq_read(seq) >= 0)
{
  // process and/or store the sequences
}
kseq_destroy(seq);
gzclose(fp);

The instream object is a file pointer that can refer to stdin or another open file handle. The file descriptor for this file pointer is obtained using the fileno function, and is then provided to the gdzopen function.

Using this approach, you can use kseq.h to parse your sequence data either from a file or from the standard input, improving the usability and flexibility of your tool!

Advertisements

5 comments

    • Daniel Standage
      #include "kseq.h"
      #include <stdio.h>
      #include <zlib.h>
      
      KSEQ_INIT(gzFile, gzread)
      
      int main(int argc, char **argv)
      {
        FILE *instream;
        gzFile fp;
        kseq_t *seq;
      
        if(argc < 2 || strcmp(argv[1], "-"))
          instream = stdin;
        else
        {
          instream = fopen(argv[1], "r");
          if(instream == NULL)
          {
            fprintf(stderr, "error opening file '%s'\n", argv[1]);
            return 1;
          }
        }
        
        fp = gzdopen(fileno(instream), "r");
        seq = kseq_init(fp);
        while(kseq_read(seq) >= 0)
        {
          // do what you want to do here.
        }
        kseq_destroy(seq);
        gzclose(fp);
      }
      
  1. Yifang

    Thanks!
    How can I decide the console argument is “inputfile” or “instream”?
    According to your first post I was trying to use the variable “inputfile”, I was thinking

     if  inputfile provided  go with inputfile pointer;
    if "instream" is provided 
    use "instream"; 

    but not sure how to put it in, which gave me hard time.
    By the way is there a preview function for the comment before submission?
    Thanks a lot!

    • Daniel Standage

      I don’t think there is a preview, but you should be able to edit your comment if it doesn’t look right.

      Your other questions involve very very basic C programming concepts. That’s not something this post or this blog is well equipped to help you with. But to be brief, inputfile and inputstream are just variable names. I can change them to anything I want: bob or poo or foobar, the program will work the same. In this case, I just used argv[i] because that is how I’m passing the name of the Fasta file to the program.

      If this isn’t clear, you really need to do some reading up on basic programming concepts, especially command line arguments, before my blog is going to make much sense. The tutorial I link to at the beginning of this post is a good place to start.

  2. Germán Meléndrez Carballo

    Thanks for the post, you have saved me some hours of searching on the web.

    I checked your tutorial mentioned at the beginning of the post, in specific the C source code. I see you use the “isatty” function. Maybe, is a good idea to expand that section of the software to accept input files from the stdin if it also comes from the pipeline (FIFO) or from the “<" operator (REG). The maxschlepzig's answer shows different command options with the "isatty" and file type returned by the "stat" function: http://stackoverflow.com/questions/1312922/detect-if-stdin-is-a-terminal-or-pipe.

    Best
    ~g.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s