Question

Missing condition information from colData in DESeq2 gives enormous False Positives?

0

Entering edit mode

hs.lansdell ▴ 20

@hslansdell-14246

Last seen 8.2 years ago

Hello!

So, I ran DESeq2 against female rna seq data with a dichotomous outcome ('yes', 'no'). I discovered, that some of my sample at the end had no value listed for condition, i.e:

10XXX	yes
59XX	no
XX13	no
8XX7
1XXX9
96XX
1XX21
XXX10

My results were staggering for the set,

out of 20332 with nonzero total read count

adjusted p-value < 0.1

LFC > 0 (up)     : 8004, 39%

LFC < 0 (down)   : 2280, 11%

outliers [1]     : 0, 0%

low counts [2]   : 0, 0%

(mean count < 0)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

> sum(res$padj < 0.1, na.rm=TRUE)

[1] 10284

> sum(res$padj < 0.05, na.rm=TRUE)

[1] 8302

> sum(res$padj < 0.001, na.rm=TRUE)

So I want to know how DESeq would treat those samples. They weren't caught by the sanity check: all(rownames(colData)==colnames(data)) obviously, so this is my bad clearly, but I would have thought they would have been dropped by DESeq and counted as NULL or NA. When I run DESeq2 with those removed, I get drastically different results (after setting independent filtering to false, and selecting a threshold from a screwy looking rejection plot):

> sum(res$padj < 0.1, na.rm=TRUE)
[1] 4

Thanks!

deseq2 • 1.5k views

ADD COMMENT • link updated 8.1 years ago by Michael Love 43k • written 8.1 years ago by hs.lansdell ▴ 20

score 0 · Answer 1 · 2017-12-14

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I'm not sure what you're showing at the top, or what design you have, or how the results table is generated. Can you update your post?

ADD COMMENT • link 8.1 years ago Michael Love 43k

0

Entering edit mode

So the top is just my count matrix. I had a few samples at the bottom that had a sample name, but no value in the actual column under 'condition. The results table was just the first:

res <- results(dds)

ADD REPLY • link 8.1 years ago hs.lansdell ▴ 20

0

Entering edit mode

The character string "" is considered its own level using factors in R (DESeq2 makes use of the factor variables and the model.matrix function to build design matrices).

It's as if you had "yes", "no", and a third option "missing". The character string "" is alphabetically first, so you will have coefficient contrasting "no" with "" and "yes" with "".

It would be better for you to actually give these samples a value of "missing" so it's more clear when other people look over your analysis. Then how to deal with these samples is up to you. I might remove these samples from the DESeqDataSet, if you want to compare yes with no:

dds <- dds[,dds$condition %in% c("no","yes")]

ADD REPLY • link 8.1 years ago Michael Love 43k

0

Entering edit mode

I did remove the samples for actual analysis. I just wanted an explanation for why it drove up the number of deferentially expressed genes, which you answered, thanks very much!

ADD REPLY • link 8.1 years ago hs.lansdell ▴ 20