DESeq2: handling NA values in colData
1
0
Entering edit mode
t.kuilman ▴ 170
@tkuilman-6868
Last seen 22 months ago
Netherlands

In one of our studies we have performed RNAsequencing on a number of samples. I would like to compare to some of these in separate analyses using DESeq2, and the following design of the colData works fine for me:

> sample <- factor(c("pos", "neg", "pos", "neg", NA, NA, NA, NA),
                   levels = c("neg", "pos", NA))
> table.metadata <- data.frame(sample)
> rownames(table.metadata) <- colnames(table)

> table.metadata
             sample
1            pos
2            neg
3            pos
4            neg
5            <NA>
6            <NA>
7            <NA>
8            <NA>
> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata,
                                design = ~ 0 + sample)
> dds <- dds[, dds$sample %in% c("neg", "pos", "t0") ]

 

Basically, by setting sample to NA and filtering them out I disregard them for analysis. However, if I perform the same analysis with the following colData

> sample <- factor(c(NA, NA, NA, NA, "pos", "neg", "pos", "neg"),
                   levels = c("neg", "pos", NA))
> table.metadata <- data.frame(sample)
> rownames(table.metadata) <- colnames(table)

> table.metadata
             sample
1            <NA>
2            <NA>
3            <NA>
4            <NA>
5            pos
6            neg
7            pos
8            neg

and I run DESeqDataSetFromMatrix, I get the following error message

> dds <- DESeqDataSetFromMatrix(countData = table, colData = table.metadata,
+                               design = ~ 0 + sample)
Error in if (all(var == var[1])) { : 
  missing value where TRUE/FALSE needed

I guess there is some sort of check for the data type that goes awry because of the NA-value in the first row of colData. I realise this can be solved by removing data from the countData and colData altogether, but I find myself making 'sub-comparisons' very frequently, and for convenience and for keeping tidy code it would be great if this issue could be addressed.

deseq2 coldata NA • 3.5k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

There is a missing/broken check, which I'll have to fix. Basically, DESeqDataSet constructors shouldn't allow NA's in design variables at all.

You should instead give these samples a level other than NA, even if it's "missing".

ADD COMMENT
0
Entering edit mode

I've fixed this now in devel, that it will give a proper error for any NA in variables in the design.

ADD REPLY
0
Entering edit mode
Thanks for clarifying that. You mention in your previous mail that one should not use NA in colData, and use something like ‘missing’ instead; is there a specific reason for that? Best, Thomas
ADD REPLY
0
Entering edit mode

Just because it will cause trouble if you ever use those samples. There is no missing data allowed in the design matrix in the methods. Putting an actual level, even “missing” will make it clear it’s being treated as a group if someone were to not properly subset, because it would be listed in the coefficients.

ADD REPLY
0
Entering edit mode

If I understand correctly, you are saying I could do something like:

colData %>% mutate(
    somefactor = 
        ifelseis.na(somefactor), "MISSING", somefactor) %>%
        factor(levels=('ref', 'interesting', 'MISSING')
  )

and then pull out the contrast somefactor_interesting_vs_ref.

This will be ok?

ADD REPLY
0
Entering edit mode

I don't think it's a great idea to treat the data with missing levels as a group similar to how you treat control and treated.

If you need to have a placeholder you can put "missing" but before running DESeq() I would remove them.

ADD REPLY

Login before adding your answer.

Traffic: 843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6