Question

Defining subsets for FDR correction

0

Entering edit mode

sven.schenk ▴ 30

@svenschenk-15837

Last seen 6.0 years ago

Vienna

Hello,

I have rather general question concering FDR/adjusted pvalues.

I did an RNASeq experiment (treated vs untreated cells), and analysed the data with DESeq. There I found that I had only very few transcripts passing the FDR cut off at 0.1, which I assumed to be because I have rather low effect sizes (max log2FC ~0.8) and a rather high variation. So I thought to define a subset of "interesing" genes, DESeq e.g. allows one to define a certain log2fc cut off for defining subsets. I however decided to not go for a log2fc cut off, as I thought variation between replicates might be the main issue (and that given my low overall log2fc I potentially might lose too much information) and so I decided to only taking those transcripts into account in which the quotient between SD and log2FC is smaller than 1 (effectively this is a pvalue cut off).

interestingGenes<-subset(input, abs(lfcSE*sqrt(3)/log2FoldChange)<=1.0)

My question(s) are now the following:

How can I now estimate the FDR for this gene set.

a) should I calculate the FDR based on the original pvalues, i.e.:

p <- interestingGenes$pvalue
p.adj <- p.adjust(p, "BH")
interestingGenes$p.adj <- p.adj

b) could I use the interestingGenes set to subset the input table and re-run DESeq? Which would re-estimate all dispersions, and size factors, and effectively be a completely independent analysis.

c) could I use the interestingGenes set to subset the DESeq countdata set? I.e.

cds <- DESeqDataSetFromMatrix(countData=countdata,colData=coldata,design = ~ pheno)
size <- estimateSizeFactors(cds)
disp.par <- estimateDispersions(size, fitType="parametric")
ddsFullCountTable <- subset(disp.par, rownames(size) %in% interestingGenes$X)
ddsFullCountTable <- DESeq(ddsFullCountTable, fitType="parametric", parallel=T)
 #using pre-existing size factors
 #estimating dispersions
 #gene-wise dispersion estimates: 4 workers

This would use the old size factors and re-estimate the dispersions for DESeq and would in contrast to b) not be an entirely different experiment.

Cheers

Sven

fdr deseq2 ihw • 1.9k views

ADD COMMENT • link updated 6.8 years ago by Michael Love 43k • written 6.8 years ago by sven.schenk ▴ 30

0

Entering edit mode

I added a DESeq2 tag, and also IHW, as the IHW folks may have some advice on increasing power at the FDR correction step.

ADD REPLY • link 6.8 years ago Michael Love 43k

0

Entering edit mode

Thanks, let`s see if it helps.

ADD REPLY • link 6.8 years ago sven.schenk ▴ 30

score 0 · Answer 1 · 2018-06-13

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 7 days ago

United States

You defintiely don’t want to use any kind of filter or “co-data” which is correlated with the test statistic under the null. This will lose type I error control in an extreme way.

The test statistic is the LFC/SE so the LFC should definitely not be used to filter. The mean of counts is independent (besides discreteness at very low counts) and we show this with histograms in the DESeq2 paper.

ADD COMMENT • link 6.8 years ago Michael Love 43k

0

Entering edit mode

ok, I guessed as much.

so that would mean I should generally not filter like this:

interestingGenes<-subset(input, abs(lfcSE*sqrt(3)/log2FoldChange)<=1)

but rather use use something involving baseMean,but how would I find an unbiased way to define a cut off for my subset? Could I use baseMean/SE?

ADD REPLY • link 6.8 years ago sven.schenk ▴ 30

0

Entering edit mode

You can’t use anything correlated with the test statistic under the null. So SE is also not good.

ADD REPLY • link 6.8 years ago Michael Love 43k

0

Entering edit mode

ok, that makes sense so nothing that comes out of DESeq (apart from baseMean) can be used. But I was thinking if baseMean is independent so should be SD(baseMean) this would also most directly account for my varioation issue.

ADD REPLY • link 6.8 years ago sven.schenk ▴ 30