Quotation marks and R’s read.table function

Today I lost about 90 minutes of my life struggling with some strange behavior in R. I had a data file with 209675 rows and 30 columns. Using shell tools I checked and re-checked the number of lines in the file and the number of values in each line, and got the expected 209675 x 30. But when I tried to load the data into R using read.table, it created a 105330 x 30 data frame without any explanation, warning, or error.

I tried filling in blank fields with sentinel values, and a variety of other things, but still I could not get a data frame with the correct dimensions.

Finally, I tracked down the source of the issue to several single quotation marks in the file. One of the fields in the data file contained a functional prediction for a transcript, and occasionally these values used the single quote to refer to sequence orientation (such as PREDICTED: bis(5′-nucleosyl)-tetraphosphatase [asymmetrical]-like [Apis florea]). Any time R encountered a single quote, it would consider everything beginning from that quote to the next quote as a single string value.

I have run into problems with quotes before–primarily with SQL files where quotation marks in strings had not been properly escaped. The solution in this case, however, was not to escape the single quotation marks, but just to instruct R to treat them as normal characters.

data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="")
dim(data)

Here, the quote="" argument is the key. By providing an empty string, I’m essentially telling R that I want to disable the default behavior altogether. If instead I had wanted to preserve the default behavior and just use a different set of quotation delimiters, I would just provide those delimiters as the argument.

# This would preserve default behavior, but only for double quotation marks
data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="\"")

# This would preserve default behavior for double quotation marks and and asterisks
data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="\"*")
Advertisements

3 comments

  1. Mrinal Kar Mohapatra

    Thanks for the share Daniel, I was trying to understand the usage of “Quote” feature in “read”. I really appreciate it.
    Thanks, have a nice time.

  2. Dave K

    Hello Daniel! Thank you for your tips on quote=””. I have been struggling with text with embedded-quote problem too! Thank you! – Dave

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s