Parsing Fasta files with C: kseq.h

The C programming language is quickly becoming my favorite. For many tasks, an interpreted language like Perl is still my first choice, since I can usually write the script very fast and superfast performance isn’t always a huge priority. However, now that I am becoming increasingly comfortable and quick coding in C, I’m finding more and more tasks where it’s worth spending a little bit more time coding to get the performance benefit I want.

I appreciate the understanding that comes from implementing everything from scratch, but I also appreciate the common programmer warning not to reinvent the wheel. For example, I often implement data structures from scratch, but I am not quite ready to invest the time required to implement a Fasta parser that is capable of handling all the many different cases that might come up. This wasn’t a big deal for me though, since most of the code I’ve been writing lately is designed to analyze gene annotations in GFF3 format, not genome sequences in Fasta format.

Well, this changed recently. I was working on some code as part of a collaboration with another department. The code needed to be fast (read: implemented in C) and it involved sequence analysis (read: has Fasta parser). I was forced to finally get serious about exploring the different options that exist out in the open source space. There are a few, but I settled with the kseq.h library. The library is essentially a self-contained C header file, so it’s extremely portable. Also, it can handle both Fasta and Fastq input, either in plain text or gzip-compressed. I’ve been pleased so far with this library, and although I’ve only used it for one project, I plan on using it in the future and would recommend it to anyone who’s looking.

Here is some sample source code from the kseq.h home page.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
// STEP 1: declare the type of file handler and the read() function
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{
	gzFile fp;
	kseq_t *seq;
	int l;
	if (argc == 1) {
		fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
		return 1;
	}
	fp = gzopen(argv[1], "r"); // STEP 2: open the file handler
	seq = kseq_init(fp); // STEP 3: initialize seq
	while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence
		printf("name: %s\n", seq->name.s);
		if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
		printf("seq: %s\n", seq->seq.s);
		if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
	}
	printf("return value: %d\n", l);
	kseq_destroy(seq); // STEP 5: destroy seq
	gzclose(fp); // STEP 6: close the file handler
	return 0;
}
Advertisements

10 comments

  1. Yifang

    Amazing libbrary, but it way beyond my comprehension to digest it. I’d like to ask you for some help on using this library. Is that OK with you? Thanks in advance!

      • yifangt

        Thanks Daniel!
        I went to the author/s site, seems not much information available. I am trying to play with this library and tried some code as following to try to reverse the sequence and the quality string of sample fastq file.

        #include 
        #include 
        #include 
        #include "kseq.h"
        
        // STEP 1: declare the type of file handler and the read() function
        KSEQ_INIT(gzFile, gzread)
        
        char *rev_comp(char *seq);
        char *strrev(char *str);
        
        int main(int argc, char *argv[])
        {
            gzFile fp;
            kseq_t *seq;
            int l;
        
        	char *entry_rc;
            char *qual_rv;
        
            if (argc != 2) {
                fprintf(stderr, "Usage: %s  ", argv[0]);
                return 1;
            }
        
            fp = gzopen(argv[1], "r"); 						// STEP 2: open the file handler
            seq = kseq_init(fp); 							// STEP 3: initialize seq
        
          
            while ((l = kseq_read(seq)) &gt;= 0) { 			// STEP 4: read sequence
              if (strlen(seq-&gt;seq.s) ==  strlen(seq-&gt;qual.s))   // Somehow local sequencer can produce unequal string of the two parts, 2bp difference.
        {
        //	entry_rc = malloc(sizeof(char)*strlen(seq-&gt;seq.s));			//Line 39:  This line is not needed! But Why????
        //	qual_rv =  malloc(sizeof(char)*strlen(seq-&gt;qual.s));		//Line 40: Not sure this is the right way either.
        	entry_rc = rev_comp(seq-&gt;seq.s);         //Line 41: Seems fine to have the DNA rev_complemented! 
        	qual_rv = strrev(seq-&gt;qual.s);			//Line 42: This line did not work! 
        	printf("&gt;%s\n%s\n%s\n%s\n%s\n", seq-&gt;name.s, seq-&gt;seq.s, seq-&gt;qual.s, entry_rc, qual_rv );
            }
         }
        
            kseq_destroy(seq); 								// STEP 5: destroy seq
            gzclose(fp); 									// STEP 6: close the file handler
        
            return 0;
        }
        
        /*************************** Functions ***************************************************/
        /*1. reverse and complement the sequence*/
        
        char *rev_comp(char *seq) {
            if( seq == NULL || !(*seq) ) 
        			return NULL;
        
            int i, j = strlen(seq)-1;
            char *seq_rc;
        
            seq_rc = malloc(sizeof(char) * (j+1));
        
            for(i=0; j&gt;=0; i++, j--) {
                *(seq_rc+i) = *(seq+j);
        		if (*(seq_rc+i) == 'A') *(seq_rc+i) ='T';
        		else if (*(seq_rc+i) == 'C') *(seq_rc+i) ='G';
        			else if (*(seq_rc+i) == 'G') *(seq_rc+i) ='C';
        				else if (*(seq_rc+i) == 'T') *(seq_rc+i) ='A';
        					else
        						 *(seq_rc+i) ='N';
            }
        
            return seq_rc;
        }
        
        
        /*2. reverse the quality string */
        
        char *strrev( char *s)
          {
          char  c;
          char* s0 = s - 1;
          char* s1 = s;
        
          /* Find the end of the string */
          while (*s1) ++s1;
        
          /* Reverse it */
          while (s1-- &gt; ++s0)
            {
            c   = *s0;
            *s0 = *s1;
            *s1 =  c;
            }
        
          return s;
          }
        

        My code compiled without error, but the quality string is not reversed. I double checked the strrev function, which seems working well.
        I am aware the problem may be related to memory allocation, so specially attention was paid to Line 39 ~ Line 42. My questions are:
        1) Why Line 39 is not needed to reverse complement the DNA sequence?
        2) What is wrong with Line 42 and what the correct way is to have the quality string reversed at Line 42?
        Thanks a lot!

  2. Sara El-Metwally

    Hello,
    Thank you very much for this valuable post.
    I am trying to implement this source code using FILE* handler rather than gzFile. I wrote the code using dev/c++ editor running on windows 7 (64-bit). I have two problems in the following code:
    First: In Step 1, if I tried to use KSEQ_INIT(FILE*, read) , I have this error message
    [Error] invalid conversion from ‘int’ to ‘FILE* {aka _iobuf*}’ [-fpermissive] ==>so I used the type of file descriptor instead.
    Second: I have this error [Error] ‘read’ was not declared in this scope. I am trying to change it to different files reading functions but it does not work. could you please help me to fix this problem.
    Thanks in advance.

    #include <stdio.h>
    #include "kseq.h"
    
    // STEP 1: declare the type of file handler and the read() function  
    KSEQ_INIT(int, read)
    
    int main(int argc, char** argv) {
        FILE* fp = fopen(argv[1], "r"); // STEP 2: open the file handler
        if(fp == 0) {
        perror("fopen");
        exit(1);
        }
        //kseq_t *seq = kseq_init(fp); // STEP 3: initialize seq 
        kseq_t *seq = kseq_init(fileno(fp)); // STEP 3: initialize seq
    
        int l;
    
        while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence  
            printf("name: %s\n", seq->name.s);  
            if (seq->comment.l) printf("comment: %s\n", seq->comment.s);  
            printf("seq: %s\n", seq->seq.s);  
            if (seq->qual.l) printf("qual: %s\n", seq->qual.s);  
        }
    
        printf("return value: %d\n", l);  
        kseq_destroy(seq); // STEP 5: destroy seq
        fclose(fp);
    
        return (0);
    }
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s