Today I lost about 90 minutes of my life struggling with some strange behavior in R. I had a data file with 209675 rows and 30 columns. Using shell tools I checked and re-checked the number of lines in the file and the number of values in each line, and got the expected 209675 x 30. But when I tried to load the data into R using
read.table, it created a 105330 x 30 data frame without any explanation, warning, or error.
I tried filling in blank fields with sentinel values, and a variety of other things, but still I could not get a data frame with the correct dimensions.
Finally, I tracked down the source of the issue to several single quotation marks in the file. One of the fields in the data file contained a functional prediction for a transcript, and occasionally these values used the single quote to refer to sequence orientation (such as PREDICTED: bis(5′-nucleosyl)-tetraphosphatase [asymmetrical]-like [Apis florea]). Any time R encountered a single quote, it would consider everything beginning from that quote to the next quote as a single string value.
I have run into problems with quotes before–primarily with SQL files where quotation marks in strings had not been properly escaped. The solution in this case, however, was not to escape the single quotation marks, but just to instruct R to treat them as normal characters.
data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="") dim(data)
quote="" argument is the key. By providing an empty string, I’m essentially telling R that I want to disable the default behavior altogether. If instead I had wanted to preserve the default behavior and just use a different set of quotation delimiters, I would just provide those delimiters as the argument.
# This would preserve default behavior, but only for double quotation marks data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="\"") # This would preserve default behavior for double quotation marks and and asterisks data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="\"*")