In one of our studies we have performed RNAsequencing on a number of samples. I would like to compare to some of these in separate analyses using DESeq2, and the following design of the colData works fine for me:
> sample <- factor(c("pos", "neg", "pos", "neg", NA, NA, NA, NA), levels = c("neg", "pos", NA)) > table.metadata <- data.frame(sample) > rownames(table.metadata) <- colnames(table) > table.metadata sample 1 pos 2 neg 3 pos 4 neg 5 <NA> 6 <NA> 7 <NA> 8 <NA>
> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata, design = ~ 0 + sample) > dds <- dds[, dds$sample %in% c("neg", "pos", "t0") ]
Basically, by setting sample to NA and filtering them out I disregard them for analysis. However, if I perform the same analysis with the following colData
> sample <- factor(c(NA, NA, NA, NA, "pos", "neg", "pos", "neg"), levels = c("neg", "pos", NA)) > table.metadata <- data.frame(sample) > rownames(table.metadata) <- colnames(table) > table.metadata sample 1 <NA> 2 <NA> 3 <NA> 4 <NA> 5 pos 6 neg 7 pos 8 neg
and I run DESeqDataSetFromMatrix
, I get the following error message
> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata, + design = ~ 0 + sample) Error in if (all(var == var[1])) { : missing value where TRUE/FALSE needed
I guess there is some sort of check for the data type that goes awry because of the NA-value in the first row of colData
. I realise this can be solved by removing data from the countData and colData altogether, but I find myself making 'sub-comparisons' very frequently, and for convenience and for keeping tidy code it would be great if this issue could be addressed.
I've fixed this now in devel, that it will give a proper error for any NA in variables in the design.
Just because it will cause trouble if you ever use those samples. There is no missing data allowed in the design matrix in the methods. Putting an actual level, even “missing” will make it clear it’s being treated as a group if someone were to not properly subset, because it would be listed in the coefficients.
If I understand correctly, you are saying I could do something like:
and then pull out the contrast
somefactor_interesting_vs_ref
.This will be ok?
I don't think it's a great idea to treat the data with missing levels as a group similar to how you treat control and treated.
If you need to have a placeholder you can put "missing" but before running
DESeq()
I would remove them.Hi Michael, Would you please explain why it would not be a good idea to treat the data with missing levels as a group?
I have 1800 samples with many batches and other technical variables I want to control for. Only 400 of these samples are of interest because they are at the time of birth of our subjects (while the remaining 1400 samples are from 2 later time points or from some other individuals who we are not interested in right now). Within the 400 samples, there are 150 in the disease group and 250 in the control group, and I want to compare disease vs control. I code the 1400 other samples as the "other" group.
To summarize, here are the groups (and number of samples in parentheses): Disease (150), Control (250), Other (1400)
The purpose of having the "other" group is that including these 1400 samples adds a lot more information about batch effects and the effects of the technical variables (since all 1800 samples were sequenced together, with the different time points randomized across batches). Indeed, I find 300 differentially expressed genes using a contrast of "disease" vs "control" when I use a counts matrix containing all 1800 samples, but only 30 differentially expressed genes for "disease" vs "control" if I only use a counts matrix containing just the 150 disease and 250 control samples.
Is there anything wrong with using all 1800 samples? Thanks for your help!
Forget about the DESeq2 context, I asked chatgpt "what's wrong with treating missing values as another group and doing an ANOVA while ignoring the nature of missingness" and got some good reasons why you wouldn't want to treat missing as another assigned treatment. I'd recommend working with a local statistician on alternative ways to model missingness.
"Treating missing values as just another group in an ANOVA, without considering the nature of the missingness, can lead to serious issues in the analysis: Invalid Assumptions:..."