DESeq2: handling NA values in colData
1
0
Entering edit mode
t.kuilman ▴ 170
@tkuilman-6868
Last seen 2.4 years ago
Netherlands

In one of our studies we have performed RNAsequencing on a number of samples. I would like to compare to some of these in separate analyses using DESeq2, and the following design of the colData works fine for me:

> sample <- factor(c("pos", "neg", "pos", "neg", NA, NA, NA, NA),
                   levels = c("neg", "pos", NA))
> table.metadata <- data.frame(sample)
> rownames(table.metadata) <- colnames(table)

> table.metadata
             sample
1            pos
2            neg
3            pos
4            neg
5            <NA>
6            <NA>
7            <NA>
8            <NA>
> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata,
                                design = ~ 0 + sample)
> dds <- dds[, dds$sample %in% c("neg", "pos", "t0") ]

 

Basically, by setting sample to NA and filtering them out I disregard them for analysis. However, if I perform the same analysis with the following colData

> sample <- factor(c(NA, NA, NA, NA, "pos", "neg", "pos", "neg"),
                   levels = c("neg", "pos", NA))
> table.metadata <- data.frame(sample)
> rownames(table.metadata) <- colnames(table)

> table.metadata
             sample
1            <NA>
2            <NA>
3            <NA>
4            <NA>
5            pos
6            neg
7            pos
8            neg

and I run DESeqDataSetFromMatrix, I get the following error message

> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata,
+                               design = ~ 0 + sample)
Error in if (all(var == var[1])) { : 
  missing value where TRUE/FALSE needed

I guess there is some sort of check for the data type that goes awry because of the NA-value in the first row of colData. I realise this can be solved by removing data from the countData and colData altogether, but I find myself making 'sub-comparisons' very frequently, and for convenience and for keeping tidy code it would be great if this issue could be addressed.

deseq2 coldata NA • 4.1k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

There is a missing/broken check, which I'll have to fix. Basically, DESeqDataSet constructors shouldn't allow NA's in design variables at all.

You should instead give these samples a level other than NA, even if it's "missing".

ADD COMMENT
0
Entering edit mode

I've fixed this now in devel, that it will give a proper error for any NA in variables in the design.

ADD REPLY
0
Entering edit mode
Thanks for clarifying that. You mention in your previous mail that one should not use NA in colData, and use something like ‘missing’ instead; is there a specific reason for that? Best, Thomas
ADD REPLY
0
Entering edit mode

Just because it will cause trouble if you ever use those samples. There is no missing data allowed in the design matrix in the methods. Putting an actual level, even “missing” will make it clear it’s being treated as a group if someone were to not properly subset, because it would be listed in the coefficients.

ADD REPLY
0
Entering edit mode

If I understand correctly, you are saying I could do something like:

colData %>% mutate(
    somefactor = 
        ifelseis.na(somefactor), "MISSING", somefactor) %>%
        factor(levels=('ref', 'interesting', 'MISSING')
  )

and then pull out the contrast somefactor_interesting_vs_ref.

This will be ok?

ADD REPLY
0
Entering edit mode

I don't think it's a great idea to treat the data with missing levels as a group similar to how you treat control and treated.

If you need to have a placeholder you can put "missing" but before running DESeq() I would remove them.

ADD REPLY
0
Entering edit mode

Hi Michael, Would you please explain why it would not be a good idea to treat the data with missing levels as a group?

I have 1800 samples with many batches and other technical variables I want to control for. Only 400 of these samples are of interest because they are at the time of birth of our subjects (while the remaining 1400 samples are from 2 later time points or from some other individuals who we are not interested in right now). Within the 400 samples, there are 150 in the disease group and 250 in the control group, and I want to compare disease vs control. I code the 1400 other samples as the "other" group.

To summarize, here are the groups (and number of samples in parentheses): Disease (150), Control (250), Other (1400)

The purpose of having the "other" group is that including these 1400 samples adds a lot more information about batch effects and the effects of the technical variables (since all 1800 samples were sequenced together, with the different time points randomized across batches). Indeed, I find 300 differentially expressed genes using a contrast of "disease" vs "control" when I use a counts matrix containing all 1800 samples, but only 30 differentially expressed genes for "disease" vs "control" if I only use a counts matrix containing just the 150 disease and 250 control samples.

Is there anything wrong with using all 1800 samples? Thanks for your help!

ADD REPLY
0
Entering edit mode

Forget about the DESeq2 context, I asked chatgpt "what's wrong with treating missing values as another group and doing an ANOVA while ignoring the nature of missingness" and got some good reasons why you wouldn't want to treat missing as another assigned treatment. I'd recommend working with a local statistician on alternative ways to model missingness.

"Treating missing values as just another group in an ANOVA, without considering the nature of the missingness, can lead to serious issues in the analysis: Invalid Assumptions:..."

ADD REPLY

Login before adding your answer.

Traffic: 662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6