Missing condition information from colData in DESeq2 gives enormous False Positives?
1
0
Entering edit mode
hs.lansdell ▴ 20
@hslansdell-14246
Last seen 7.1 years ago

Hello!

So, I ran DESeq2 against female rna seq data with a dichotomous outcome ('yes', 'no'). I discovered, that some of my sample at the end had no value listed for condition, i.e:

10XXX yes
59XX no
XX13 no
8XX7  
1XXX9  
96XX  
1XX21  
XXX10  

 

My results were staggering for the set, 

out of 20332 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 8004, 39% 
LFC < 0 (down)   : 2280, 11% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
 
> sum(res$padj < 0.1, na.rm=TRUE)
[1] 10284
> sum(res$padj < 0.05, na.rm=TRUE)
[1] 8302
> sum(res$padj < 0.001, na.rm=TRUE)

So I want to know how DESeq would treat those samples. They weren't caught by the sanity check: all(rownames(colData)==colnames(data)) obviously, so this is my bad clearly, but I would have thought they would have been dropped by DESeq and counted as NULL or NA. When I run DESeq2 with those removed, I get drastically different results (after setting independent filtering to false, and selecting a threshold from a screwy looking rejection plot):

> sum(res$padj < 0.1, na.rm=TRUE)
[1] 4

Thanks! 

deseq2 • 1.2k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 4 days ago
United States

I'm not sure what you're showing at the top, or what design you have, or how the results table is generated. Can you update your post?

ADD COMMENT
0
Entering edit mode

So the top is just my count matrix. I had a few samples at the bottom that had a sample name, but no value in the actual column under 'condition. The results table was just the first: 

res <- results(dds)
ADD REPLY
0
Entering edit mode

The character string "" is considered its own level using factors in R (DESeq2 makes use of the factor variables and the model.matrix function to build design matrices).

It's as if you had "yes", "no", and a third option "missing". The character string "" is alphabetically first, so you will have coefficient contrasting "no" with "" and "yes" with "". 

It would be better for you to actually give these samples a value of "missing" so it's more clear when other people look over your analysis. Then how to deal with these samples is up to you. I might remove these samples from the DESeqDataSet, if you want to compare yes with no:

dds <- dds[,dds$condition %in% c("no","yes")]
ADD REPLY
0
Entering edit mode

I did remove the samples for actual analysis. I just wanted an explanation for why it drove up the number of deferentially expressed genes, which you answered, thanks very much! 

ADD REPLY

Login before adding your answer.

Traffic: 665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6