DESeq2: colData issue

Question

DESeq2: Possible issue with colData and sorting

0

Entering edit mode

henrik.seidel ▴ 40

@henrikseidel-11484

Last seen 8.2 years ago

I found a very strange behavior of colData. Here it is.

First, I create a data.table containing information about samples, let's call it sampleInfo. This table contains (among many others) the columns ngsName, treatment, and treatmentDuration. I use this data.table for creating a DESeqDataSet:

```{r prepare_data}
dds <- DESeqDataSetFromMatrix(
  countData = pcCounts,
  colData = sampleInfo,
  design = ~ treatment + treatmentDuration + treatment:treatmentDuration)
vstDds <- varianceStabilizingTransformation(dds, blind = FALSE)
```

Later in the code I set a key to sampleInfo:

setkey(sampleInfo, ngsName)

This resorts the data.table sampleInfo. The issue is: from now on the colData(dds) is wrong - it has the same order of rows as the resorted sampleInfo table and not the original order of rows which sampleInfo had when dds was created. So to me this looks as if DESeqDataSetFromMatrix stores a reference to the sampleInfo table and not a copy of sampleInfo. This was very surprising and unexpected for me, it took me quite a while to track this down. I am not sure whether this behavior should be expected. One workaround is, of course, to explicitely generate a copy using something like "colData = data.frame(sampleInfo)".

Any ideas?

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] car_2.1-3                  factoextra_1.0.3           cluster_2.0.4              dendextend_1.2.0          
 [5] gplots_3.0.1               org.Hs.eg.db_3.3.0         AnnotationDbi_1.34.0       ggfortify_0.2.0           
 [9] gridExtra_2.2.1            RColorBrewer_1.1-2         ggplot2_2.1.0              DESeq2_1.12.0             
[13] SummarizedExperiment_1.2.0 Biobase_2.32.0             GenomicRanges_1.24.0       GenomeInfoDb_1.8.0        
[17] IRanges_2.6.0              S4Vectors_0.10.0           BiocGenerics_0.18.0        edgeR_3.14.0              
[21] limma_3.28.1               xlsx_0.5.7                 xlsxjars_0.6.1             rJava_0.9-8               
[25] data.table_1.9.6           assertthat_0.1             pander_0.6.0               baybi_1.0.6               

loaded via a namespace (and not attached):
 [1] nlme_3.1-128        bitops_1.0-6        pbkrtest_0.4-6      prabclus_2.2-6      tools_3.3.0         R6_2.1.2           
 [7] rpart_4.1-10        KernSmooth_2.23-15  Hmisc_3.17-4        DBI_0.4             mgcv_1.8-14         colorspace_1.2-6   
[13] trimcluster_0.1-2   nnet_7.3-12         chron_2.3-47        quantreg_5.29       SparseM_1.7         labeling_0.3       
[19] diptest_0.75-7      caTools_1.17.1      scales_0.4.0        DEoptimR_1.0-4      mvtnorm_1.0-5       robustbase_0.92-6  
[25] genefilter_1.54.0   digest_0.6.9        foreign_0.8-66      minqa_1.2.4         XVector_0.12.0      lme4_1.1-12        
[31] RSQLite_1.0.0       mclust_5.2          BiocParallel_1.6.0  gtools_3.5.0        acepack_1.3-3.3     dplyr_0.4.3        
[37] magrittr_1.5        modeltools_0.2-21   Formula_1.2-1       Matrix_1.2-6        Rcpp_0.12.4.5       munsell_0.4.3      
[43] whisker_0.3-2       MASS_7.3-45         zlibbioc_1.18.0     flexmix_2.3-13      plyr_1.8.3          gdata_2.17.0       
[49] ggrepel_0.5         lattice_0.20-33     splines_3.3.0       annotate_1.50.0     locfit_1.5-9.1      knitr_1.12.3       
[55] fpc_2.1-10          geneplotter_1.50.0  XML_3.98-1.4        latticeExtra_0.6-28 nloptr_1.0.4        MatrixModels_0.4-1 
[61] gtable_0.2.0        tidyr_0.5.1         kernlab_0.9-24      xtable_1.8-2        class_7.3-14        survival_2.39-3    
[67] tibble_1.0

deseq2 summarizedexperiment • 3.4k views

ADD COMMENT • link updated 8.6 years ago by Martin Morgan 25k • written 8.6 years ago by henrik.seidel ▴ 40

score 0 · Answer 1 · 2016-09-15

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 10 days ago

United States

DESeqDataSet is a sub class of SummarizedExperiment. Can you produce a small, reproducible example with SummarizedExperiment and we can have the maintainers take a look?

ADD COMMENT • link 8.6 years ago Michael Love 43k

0

Entering edit mode

Will do so, might might take me a couple of days.

ADD REPLY • link 8.6 years ago henrik.seidel ▴ 40

0

Entering edit mode

I should add one observation I made. If I save a session image just before the setkey operation, load this session afterwards, and then continue with the rest of the script, everything works as expected. One explanation for this could be that what we have here is "copy on write", i.e. the DESeqDataSet stores a reference to the source table but fails to create a copy if the original sampleInfo table is changed. By saving and reloading the session image, I force a copy, and this causes everything to work fine afterwards. Just a theory, but could be true.

ADD REPLY • link 8.6 years ago henrik.seidel ▴ 40

0

Entering edit mode

DESeq2: colData issue

Henrik Seidel

2016-09-16

This is a short program demonstrating the issue with colData in the current implementation of DESeqDataSetFromMatrix.

Loading libraries:

library(data.table)
library(DESeq2)

Prepare data table:

dt <- data.table(id = c("B", "C", "A"), val = c(2, 3, 1))
print(dt)

##    id val
## 1:  B   2
## 2:  C   3
## 3:  A   1

First, we create a data frame in the way as it is done in DESeqDataSetFromMatrix:

df1 <- as(dt, "DataFrame")

Next, we create a data frame in the ways it should be done:

df2 <- as(dt, "data.frame")

df3 <- as.data.frame(dt)

df4 <- as(as.data.frame(dt), "DataFrame")

So let’s see what happens if we set a key in the data table dt:

setkey(dt, id)
print(dt)

##    id val
## 1:  A   1
## 2:  B   2
## 3:  C   3

And what are the consequences for df1, df2, df3 and df4?

print(df1)

## DataFrame with 3 rows and 2 columns
##            id       val
##   <character> <numeric>
## 1           A         1
## 2           B         2
## 3           C         3

So we can see that df1 has now a new order of rows, corresponding to the order of rows of the data table dt.

The other three data frames keep the original order:

print(df2)

##   id val
## 1  B   2
## 2  C   3
## 3  A   1

print(df3)

##   id val
## 1  B   2
## 2  C   3
## 3  A   1

print(df4)

## DataFrame with 3 rows and 2 columns
##            id       val
##   <character> <numeric>
## 1           B         2
## 2           C         3
## 3           A         1

Obviously, a simple as(..., "DataFrame") stores a reference to the first argument instead of a copy. I think you should use as(as.data.frame(colData), "DataFrame") in the DESeqDataSetFromMatrix function.

ADD REPLY • link 8.6 years ago henrik.seidel ▴ 40

0

Entering edit mode

Can you add SummarizedExperiment as a tag? Most likely a fix will have to come from this package which is a dependency, and we need to alert the maintainers by adding the tag.

ADD REPLY • link 8.6 years ago Michael Love 43k

0

Entering edit mode

Isn't this behaving as expected? You construct the SummarizedExperiment with an object that has pass-by-reference semantics, so the SummarizedExperiment also has pass-by-reference semantics? I didn't follow this closely, so could be missing something...

ADD REPLY • link 8.6 years ago Martin Morgan 25k

0

Entering edit mode

Is there a way to generate a message at object construction if something which will break the expected link between the "three tables" is provided?

ADD REPLY • link 8.6 years ago Michael Love 43k

0

Entering edit mode

There could be an explicit test for a data.table object, but to be honest I don't think this is a SummarizedExperiment issue since data.table's behavior is quite un-R like in this regard. I'm only saying that because there could be another type of object that acts in this way, too, but it's not clear how one could know that whatever-the-next-type-of-thing is has manipulate by reference semantics?

Perhaps one extreme would be to have SummarizedExperiment (and everything else) just defensively copy everything first, but you wouldn't want to do that, either.

With apologies to the OP, in my opinion, the onus can really only lie on the shoulders of the programmer using these manipulate-by-reference objects to be defensive in their programming when they are working with them and to be cognizant of how their behavior violates R's expected behavior ... especially when handing these objects over to "something else" which will manipulate it as it will.

ADD REPLY • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Well, I am probably not an R expert enough to decide what is un-R like and what is not. For me, at least, it was totally unexpected that DESeqDataSet maintained a reference to a variable in the global environment (the sampleInfo data table) and that changing this global variable later will make the DESeqDataSet invalid. If there are good arguments not to change this behaviour (the decision of which I leave up to the package providers), there should at least be a warning in the manuals that certain objects, for example data.tables, will be passed by reference when using DESeqDataSetFromMatrix and either must not be changed afterwards or must be explicitly copied by the programmer when constructing the DESeqDataSet (using as.data.frame).

ADD REPLY • link 8.6 years ago henrik.seidel ▴ 40

0

Entering edit mode

This behavior you describe affects all packages which build on top of SummarizedExperiments (and perhaps also ExpressionSet). So in my opinion, if a user were to be warned about the use of data.table here, such a warning would be best coming from upstream in the package that defines the super-class, not written into every package that depends on this.

Then the question comes whether or not SummarizedExperiment should produce a message or warning regarding objects that do not behave in the predicted way. Both DESeq2 and SummarizedExperiment both state in documentation that colData needs to be a DataFrame (or data.frame for DESeq2, which is converted to DataFrame). I see why you were surprised, but I more agree with Steve's points about not being able to know in advance which classes will produce this behavior, and the fact that you are venturing off the recommended path by not providing the constructor function with the class that is requested in the documentation of these packages.

A side question: why are you storing the colData with data.table, which is designed for fast operations on tables that take up many gigabytes?

ADD REPLY • link 8.6 years ago Michael Love 43k

0

Entering edit mode

It's not uncommon to just use data.tables "through and through" ie. if you're using them anyway, one often uses them for small and large data.frame-like object alike. The programmer just needs to be careful when passing these objects across "package boundaries" is all ...

ADD REPLY • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

It's not that the DESeqDataSet is maintaining a reference to a variable in the global environment, it really is just the "behind the scenes" behavior the set* operations in the data.table package are using to manipulate their object by reference (for speed/memory efficiency)

In the ideal case, R will only copy an object when necessary, to mitigate excessive memory use and such. When R objects (data.frames, whatever) are manipulated "in the usual way", other R objects (like the colData of your DESeqDataSet) will be notified and then copy its version of the object so it doesn't change.

Let's look at a simple example of changing column names using data.table's setnames vs. "the normal" names(dt) <- c(...). The as(..., 'DataFrame') object will wait to make a copy of its contents until its necessary, while the as(..., 'data.frame') will just make the copy up front.

library(IRanges)
library(data.table)

dt <- data.table(id = c("B", "C", "A"), val = c(2, 3, 1))
df1 <- as(dt, "DataFrame")
df2 <- as(dt, "data.frame")

setnames(dt, c('a', 'b'))
dt
##    a b
## 1: B 2
## 2: C 3
## 3: A 1

df1
## DataFrame with 3 rows and 2 columns
##             a         b
##   <character> <numeric>
## 1           B         2
## 2           C         3
## 3           A         1

df2
##    id val
## 1|  B   2
## 2|  C   3
## 3|  A   1

Now when you use the "normal" names()<- function, the df1 DataFrame will realize you're changing the colnames of dt and ensure that its "local" version is kosher (ie. its names won't change (again)).

names(dt) <- c('one', 'two')

dt
##    one two
## 1:   B   2
## 2:   C   3
## 3:   A   1

df1
## DataFrame with 3 rows and 2 columns
##             a         b
##   <character> <numeric>
## 1           B         2
## 2           C         3
## 3           A         1

df2
##    id val
## 1|  B   2
## 2|  C   3
## 3|  A   1

Note that this example will only work with the development version of data.table, because there was actually a bug in its "names<-.data.table" function which didn't "alert" the appropriate authorities that the column names of its object were changing ...

So, what did we learn? While data.table is an incredibly useful/powerful library that provides many useful gains, you still need to be aware of some of its idiosyncratic behavior. The authors of data.table do their best to ensure that the data.table looks like a data.frame as much as possible when these objects are passed over "package" boundaries, but it's incredibly difficult to get right.

So, when you are doing the "handing off" of these object to other packages, it's probably always going to be a good idea to call as.data.frame on it first. This will make a deep copy of the object, so not always ideal, but it keeps things "safe" and can avoid surprising behavior.

ADD REPLY • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Okay, I added SummarizedExperiment as a tag.

ADD REPLY • link 8.6 years ago henrik.seidel ▴ 40

score 0 · Answer 2 · 2016-09-15

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 6 days ago

United States

The various set* operations over a data.table work by reference, so when you setkey your sampleInfo data.table, it is likely hammering the same one you used as the colData in your DESeqDataSet.

In your call to DESeqDataSet, you could just do colData=as.data.frame(sampleInfo), which will copy the object and set it to a data.frame, which should do the trick for you.

ADD COMMENT • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thank you, Steve. And yes, this is what I already suggested as a workaround in my original posting (although I suggested to use "data.frame()" instead of "as.data.frame"). I know that the various set* operations work by reference. The surprising point for me was that DESeqDataSetFromMatrix stores a reference to the data.table instead of creating a copy.

ADD REPLY • link 8.6 years ago henrik.seidel ▴ 40

score 0 · Answer 3 · 2016-09-19

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 weeks ago

United States

This is fixed in the 'devel' version of S4Vectors, version 0.11.15

------------------------------------------------------------------------
r121022 | hpages@fhcrc.org | 2016-09-16 18:10:10 -0400 (Fri, 16 Sep 2016) | 1 line

add coercion method from data.table to DataFrame
------------------------------------------------------------------------