Question

DESeq2: handling NA values in colData

0

Entering edit mode

t.kuilman ▴ 170

@tkuilman-6868

Last seen 3.5 years ago

Netherlands

In one of our studies we have performed RNAsequencing on a number of samples. I would like to compare to some of these in separate analyses using DESeq2, and the following design of the colData works fine for me:

> sample <- factor(c("pos", "neg", "pos", "neg", NA, NA, NA, NA),
                   levels = c("neg", "pos", NA))
> table.metadata <- data.frame(sample)
> rownames(table.metadata) <- colnames(table)

> table.metadata
             sample
1            pos
2            neg
3            pos
4            neg
5            <NA>
6            <NA>
7            <NA>
8            <NA>

> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata,
                                design = ~ 0 + sample)
> dds <- dds[, dds$sample %in% c("neg", "pos", "t0") ]

Basically, by setting sample to NA and filtering them out I disregard them for analysis. However, if I perform the same analysis with the following colData

> sample <- factor(c(NA, NA, NA, NA, "pos", "neg", "pos", "neg"),
                   levels = c("neg", "pos", NA))
> table.metadata <- data.frame(sample)
> rownames(table.metadata) <- colnames(table)

> table.metadata
             sample
1            <NA>
2            <NA>
3            <NA>
4            <NA>
5            pos
6            neg
7            pos
8            neg

and I run DESeqDataSetFromMatrix, I get the following error message

> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata,
+                               design = ~ 0 + sample)
Error in if (all(var == var[1])) { : 
  missing value where TRUE/FALSE needed

I guess there is some sort of check for the data type that goes awry because of the NA-value in the first row of colData. I realise this can be solved by removing data from the countData and colData altogether, but I find myself making 'sub-comparisons' very frequently, and for convenience and for keeping tidy code it would be great if this issue could be addressed.

deseq2 coldata NA • 4.9k views

ADD COMMENT • link updated 16 months ago by Michael Love 43k • written 7.4 years ago by t.kuilman ▴ 170

Michael Love · Answer 1 · 2018-08-09

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

There is a missing/broken check, which I'll have to fix. Basically, DESeqDataSet constructors shouldn't allow NA's in design variables at all.

You should instead give these samples a level other than NA, even if it's "missing".

ADD COMMENT • link 7.4 years ago Michael Love 43k

0

Entering edit mode

I've fixed this now in devel, that it will give a proper error for any NA in variables in the design.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Thanks for clarifying that. You mention in your previous mail that one should not use NA in colData, and use something like ‘missing’ instead; is there a specific reason for that? Best, Thomas

ADD REPLY • link updated 7.3 years ago by Michael Love 43k • written 7.3 years ago by t.kuilman ▴ 170

0

Entering edit mode

Just because it will cause trouble if you ever use those samples. There is no missing data allowed in the design matrix in the methods. Putting an actual level, even “missing” will make it clear it’s being treated as a group if someone were to not properly subset, because it would be listed in the coefficients.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

If I understand correctly, you are saying I could do something like:

colData %>% mutate(
    somefactor = 
        ifelseis.na(somefactor), "MISSING", somefactor) %>%
        factor(levels=('ref', 'interesting', 'MISSING')
  )

and then pull out the contrast somefactor_interesting_vs_ref.

This will be ok?

ADD REPLY • link 6.3 years ago ariel ▴ 20

0

Entering edit mode

I don't think it's a great idea to treat the data with missing levels as a group similar to how you treat control and treated.

If you need to have a placeholder you can put "missing" but before running DESeq() I would remove them.

ADD REPLY • link 6.3 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, Would you please explain why it would not be a good idea to treat the data with missing levels as a group?

I have 1800 samples with many batches and other technical variables I want to control for. Only 400 of these samples are of interest because they are at the time of birth of our subjects (while the remaining 1400 samples are from 2 later time points or from some other individuals who we are not interested in right now). Within the 400 samples, there are 150 in the disease group and 250 in the control group, and I want to compare disease vs control. I code the 1400 other samples as the "other" group.

To summarize, here are the groups (and number of samples in parentheses): Disease (150), Control (250), Other (1400)

The purpose of having the "other" group is that including these 1400 samples adds a lot more information about batch effects and the effects of the technical variables (since all 1800 samples were sequenced together, with the different time points randomized across batches). Indeed, I find 300 differentially expressed genes using a contrast of "disease" vs "control" when I use a counts matrix containing all 1800 samples, but only 30 differentially expressed genes for "disease" vs "control" if I only use a counts matrix containing just the 150 disease and 250 control samples.

Is there anything wrong with using all 1800 samples? Thanks for your help!

ADD REPLY • link 16 months ago Gabriel • 0

0

Entering edit mode

Forget about the DESeq2 context, I asked chatgpt "what's wrong with treating missing values as another group and doing an ANOVA while ignoring the nature of missingness" and got some good reasons why you wouldn't want to treat missing as another assigned treatment. I'd recommend working with a local statistician on alternative ways to model missingness.

"Treating missing values as just another group in an ANOVA, without considering the nature of the missingness, can lead to serious issues in the analysis: Invalid Assumptions:..."

ADD REPLY • link 16 months ago Michael Love 43k