Category: Pandas

Pandas: querying a data frame and the `SettingWithCopyWarning` message

I’m doing more and more of my data analysis with Python Pandas these days, but I’m far from an expert and still much more comfortable with R. The reason I persist, though, is that Python provides a complete data processing and analysis stack. In the past, I’ve typically used Perl or Python scripts to pre-process data, R to do statistics and visualization, and shell scripts to glue it all together. Python’s rapidly maturing libraries for numerics, statistics, and data analytics, together with an active community and the awesome IPython notebook, make the prospect of doing end-to-end analysis completely in Python quite compelling.

That said, I still have a lot to learn regarding Pandas. Earlier today I was doing some simple operations—taking a subset of a data frame and creating a new column by applying a transformation to another column—and I kept getting a cryptic SettingWithCopyWarning message. Here’s a dummy example that reproduces the message.

>>> # Load data into memory
>>> import pandas
>>> exons = pandas.io.parsers.read_table('test.dat')
>>> exons
                 Species  ID  Length   Context
0        Equus monoceros  e1     162      stop
1          Homo diutinus  e2     111     start
2  Draconis occidentalis  e3      51       cds
3    Basiliscus vulgaris  e4     114  complete
4          Equus pegasus  e5      95       utr
>>>
>>> # Subset the data
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')]
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> import math
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
__main__:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
>>>

I went on to plot the data and there didn’t seem to be any problems. I also followed the link provided in the warning message and read the documentation. If I’m reading things correctly, the warning is issued to inform users when they might be operating on a copy of the original data, rather than the original data itself. In this example, if I make changes to the ungulates variable and want those changes reflected in the original exons variable, then I do not want to be operating on a copy of the data but the original data itself.

However, in my case it wasn’t really a huge concern, so I was able to get rid of the warning message by making my intent to work on a copy of the data explicit.

>>> # Subset the data, but make an explicit copy
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')].copy()
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
>>>
>>> # No errors this time, woohoo!
>>>

If you’re interested in seeing the interactive Python interpreter in action, check out this asciicast.

Advertisements