I found a very strange behavior of colData. Here it is.
First, I create a data.table
containing information about samples, let's call it sampleInfo
. This table contains (among many others) the columns ngsName
, treatment
, and treatmentDuration
. I use this data.table
for creating a DESeqDataSet
:
```{r prepare_data} dds <- DESeqDataSetFromMatrix( countData = pcCounts, colData = sampleInfo, design = ~ treatment + treatmentDuration + treatment:treatmentDuration) vstDds <- varianceStabilizingTransformation(dds, blind = FALSE) ```
Later in the code I set a key to sampleInfo:
setkey(sampleInfo, ngsName)
This resorts the data.table sampleInfo
. The issue is: from now on the colData(dds)
is wrong - it has the same order of rows as the resorted sampleInfo
table and not the original order of rows which sampleInfo had when dds
was created. So to me this looks as if DESeqDataSetFromMatrix stores a reference to the sampleInfo table and not a copy of sampleInfo. This was very surprising and unexpected for me, it took me quite a while to track this down. I am not sure whether this behavior should be expected. One workaround is, of course, to explicitely generate a copy using something like "colData = data.frame(sampleInfo)".
Any ideas?
> sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] car_2.1-3 factoextra_1.0.3 cluster_2.0.4 dendextend_1.2.0 [5] gplots_3.0.1 org.Hs.eg.db_3.3.0 AnnotationDbi_1.34.0 ggfortify_0.2.0 [9] gridExtra_2.2.1 RColorBrewer_1.1-2 ggplot2_2.1.0 DESeq2_1.12.0 [13] SummarizedExperiment_1.2.0 Biobase_2.32.0 GenomicRanges_1.24.0 GenomeInfoDb_1.8.0 [17] IRanges_2.6.0 S4Vectors_0.10.0 BiocGenerics_0.18.0 edgeR_3.14.0 [21] limma_3.28.1 xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 [25] data.table_1.9.6 assertthat_0.1 pander_0.6.0 baybi_1.0.6 loaded via a namespace (and not attached): [1] nlme_3.1-128 bitops_1.0-6 pbkrtest_0.4-6 prabclus_2.2-6 tools_3.3.0 R6_2.1.2 [7] rpart_4.1-10 KernSmooth_2.23-15 Hmisc_3.17-4 DBI_0.4 mgcv_1.8-14 colorspace_1.2-6 [13] trimcluster_0.1-2 nnet_7.3-12 chron_2.3-47 quantreg_5.29 SparseM_1.7 labeling_0.3 [19] diptest_0.75-7 caTools_1.17.1 scales_0.4.0 DEoptimR_1.0-4 mvtnorm_1.0-5 robustbase_0.92-6 [25] genefilter_1.54.0 digest_0.6.9 foreign_0.8-66 minqa_1.2.4 XVector_0.12.0 lme4_1.1-12 [31] RSQLite_1.0.0 mclust_5.2 BiocParallel_1.6.0 gtools_3.5.0 acepack_1.3-3.3 dplyr_0.4.3 [37] magrittr_1.5 modeltools_0.2-21 Formula_1.2-1 Matrix_1.2-6 Rcpp_0.12.4.5 munsell_0.4.3 [43] whisker_0.3-2 MASS_7.3-45 zlibbioc_1.18.0 flexmix_2.3-13 plyr_1.8.3 gdata_2.17.0 [49] ggrepel_0.5 lattice_0.20-33 splines_3.3.0 annotate_1.50.0 locfit_1.5-9.1 knitr_1.12.3 [55] fpc_2.1-10 geneplotter_1.50.0 XML_3.98-1.4 latticeExtra_0.6-28 nloptr_1.0.4 MatrixModels_0.4-1 [61] gtable_0.2.0 tidyr_0.5.1 kernlab_0.9-24 xtable_1.8-2 class_7.3-14 survival_2.39-3 [67] tibble_1.0
Will do so, might might take me a couple of days.
I should add one observation I made. If I save a session image just before the setkey operation, load this session afterwards, and then continue with the rest of the script, everything works as expected. One explanation for this could be that what we have here is "copy on write", i.e. the DESeqDataSet stores a reference to the source table but fails to create a copy if the original sampleInfo table is changed. By saving and reloading the session image, I force a copy, and this causes everything to work fine afterwards. Just a theory, but could be true.
DESeq2: colData issue
DESeq2: colData issue
Henrik Seidel
2016-09-16
This is a short program demonstrating the issue with
colData
in the current implementation ofDESeqDataSetFromMatrix
.Loading libraries:
Prepare data table:
First, we create a data frame in the way as it is done in
DESeqDataSetFromMatrix
:Next, we create a data frame in the ways it should be done:
So let’s see what happens if we set a key in the data table dt:
And what are the consequences for df1, df2, df3 and df4?
So we can see that df1 has now a new order of rows, corresponding to the order of rows of the data table
dt
.The other three data frames keep the original order:
Obviously, a simple
as(..., "DataFrame")
stores a reference to the first argument instead of a copy. I think you should useas(as.data.frame(colData), "DataFrame")
in theDESeqDataSetFromMatrix
function.Isn't this behaving as expected? You construct the SummarizedExperiment with an object that has pass-by-reference semantics, so the SummarizedExperiment also has pass-by-reference semantics? I didn't follow this closely, so could be missing something...
There could be an explicit test for a data.table object, but to be honest I don't think this is a SummarizedExperiment issue since data.table's behavior is quite un-R like in this regard. I'm only saying that because there could be another type of object that acts in this way, too, but it's not clear how one could know that whatever-the-next-type-of-thing is has manipulate by reference semantics?
Perhaps one extreme would be to have SummarizedExperiment (and everything else) just defensively copy everything first, but you wouldn't want to do that, either.
With apologies to the OP, in my opinion, the onus can really only lie on the shoulders of the programmer using these manipulate-by-reference objects to be defensive in their programming when they are working with them and to be cognizant of how their behavior violates R's expected behavior ... especially when handing these objects over to "something else" which will manipulate it as it will.
Well, I am probably not an R expert enough to decide what is un-R like and what is not. For me, at least, it was totally unexpected that DESeqDataSet maintained a reference to a variable in the global environment (the sampleInfo data table) and that changing this global variable later will make the DESeqDataSet invalid. If there are good arguments not to change this behaviour (the decision of which I leave up to the package providers), there should at least be a warning in the manuals that certain objects, for example data.tables, will be passed by reference when using DESeqDataSetFromMatrix and either must not be changed afterwards or must be explicitly copied by the programmer when constructing the DESeqDataSet (using as.data.frame).
This behavior you describe affects all packages which build on top of SummarizedExperiments (and perhaps also ExpressionSet). So in my opinion, if a user were to be warned about the use of data.table here, such a warning would be best coming from upstream in the package that defines the super-class, not written into every package that depends on this.
Then the question comes whether or not SummarizedExperiment should produce a message or warning regarding objects that do not behave in the predicted way. Both DESeq2 and SummarizedExperiment both state in documentation that colData needs to be a DataFrame (or data.frame for DESeq2, which is converted to DataFrame). I see why you were surprised, but I more agree with Steve's points about not being able to know in advance which classes will produce this behavior, and the fact that you are venturing off the recommended path by not providing the constructor function with the class that is requested in the documentation of these packages.
A side question: why are you storing the colData with data.table, which is designed for fast operations on tables that take up many gigabytes?
It's not uncommon to just use data.tables "through and through" ie. if you're using them anyway, one often uses them for small and large data.frame-like object alike. The programmer just needs to be careful when passing these objects across "package boundaries" is all ...
It's not that the
DESeqDataSet
is maintaining a reference to a variable in the global environment, it really is just the "behind the scenes" behavior theset*
operations in the data.table package are using to manipulate their object by reference (for speed/memory efficiency)In the ideal case, R will only copy an object when necessary, to mitigate excessive memory use and such. When R objects (data.frames, whatever) are manipulated "in the usual way", other R objects (like the
colData
of your DESeqDataSet) will be notified and then copy its version of the object so it doesn't change.Let's look at a simple example of changing column names using data.table's
setnames
vs. "the normal"names(dt) <- c(...).
Theas(..., 'DataFrame')
object will wait to make a copy of its contents until its necessary, while theas(..., 'data.frame')
will just make the copy up front.Now when you use the "normal"
names()<-
function, thedf1
DataFrame will realize you're changing the colnames ofdt
and ensure that its "local" version is kosher (ie. its names won't change (again)).
names(dt) <- c('one', 'two') dt ## one two ## 1: B 2 ## 2: C 3 ## 3: A 1 df1 ## DataFrame with 3 rows and 2 columns ## a b ## <character> <numeric> ## 1 B 2 ## 2 C 3 ## 3 A 1 df2 ## id val ## 1| B 2 ## 2| C 3 ## 3| A 1
Note that this example will only work with the development version of data.table, because there was actually a bug in its
"names<-.data.table"
function which didn't "alert" the appropriate authorities that the column names of its object were changing ...So, what did we learn? While data.table is an incredibly useful/powerful library that provides many useful gains, you still need to be aware of some of its idiosyncratic behavior. The authors of data.table do their best to ensure that the data.table looks like a data.frame as much as possible when these objects are passed over "package" boundaries, but it's incredibly difficult to get right.
So, when you are doing the "handing off" of these object to other packages, it's probably always going to be a good idea to call
as.data.frame
on it first. This will make a deep copy of the object, so not always ideal, but it keeps things "safe" and can avoid surprising behavior.Okay, I added SummarizedExperiment as a tag.