In one of our studies we have performed RNAsequencing on a number of samples. I would like to compare to some of these in separate analyses using DESeq2, and the following design of the colData works fine for me:
> sample <- factor(c("pos", "neg", "pos", "neg", NA, NA, NA, NA), levels = c("neg", "pos", NA)) > table.metadata <- data.frame(sample) > rownames(table.metadata) <- colnames(table) > table.metadata sample 1 pos 2 neg 3 pos 4 neg 5 <NA> 6 <NA> 7 <NA> 8 <NA>
> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata, design = ~ 0 + sample) > dds <- dds[, dds$sample %in% c("neg", "pos", "t0") ]
Basically, by setting sample to NA and filtering them out I disregard them for analysis. However, if I perform the same analysis with the following colData
> sample <- factor(c(NA, NA, NA, NA, "pos", "neg", "pos", "neg"), levels = c("neg", "pos", NA)) > table.metadata <- data.frame(sample) > rownames(table.metadata) <- colnames(table) > table.metadata sample 1 <NA> 2 <NA> 3 <NA> 4 <NA> 5 pos 6 neg 7 pos 8 neg
and I run DESeqDataSetFromMatrix
, I get the following error message
> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata, + design = ~ 0 + sample) Error in if (all(var == var[1])) { : missing value where TRUE/FALSE needed
I guess there is some sort of check for the data type that goes awry because of the NA-value in the first row of colData
. I realise this can be solved by removing data from the countData and colData altogether, but I find myself making 'sub-comparisons' very frequently, and for convenience and for keeping tidy code it would be great if this issue could be addressed.
I've fixed this now in devel, that it will give a proper error for any NA in variables in the design.
Just because it will cause trouble if you ever use those samples. There is no missing data allowed in the design matrix in the methods. Putting an actual level, even “missing” will make it clear it’s being treated as a group if someone were to not properly subset, because it would be listed in the coefficients.
If I understand correctly, you are saying I could do something like:
and then pull out the contrast
somefactor_interesting_vs_ref
.This will be ok?
I don't think it's a great idea to treat the data with missing levels as a group similar to how you treat control and treated.
If you need to have a placeholder you can put "missing" but before running
DESeq()
I would remove them.