Is it reasonable to run DESeq2 on only a subset of transcripts of the original raw count matrix?
1
1
Entering edit mode
Alan ▴ 20
@alan-15011
Last seen 5.9 years ago

Hi Michael and community,

As always, thank you for your devotion in DESeq2. I'd like to ask about: is it reasonable to to run DESeq2 analysis on only a "subset" of the original raw count matrix? (would the DESeq2 statistical model still apply?) For instance, if I am only interested in the coding genes of the transcriptome, then can I filter out non-coding genes (in the rows) from the original count matrix, but keep all my samples (in the columns), and then run DESeq2 to find DE coding genes between sample conditions?

Thank you in advance!

Alan

R version 3.4.3 (2017-11-30), DESeq2_1.18.1 

deseq2 • 5.4k views
ADD COMMENT
4
Entering edit mode
@mikelove
Last seen 22 hours ago
United States

You can subset to a smaller set of rows, here with protein coding genes I don't see a problem. You generally want to let DESeq() see as many genes as possible as this helps the dispersion and LFC estimation steps, which construct priors by looking at all genes. And for normalization, it is required that not all the genes be greatly differentially expressed, or else it's not possible to estimate the size factor (library size correction). So by looking at all expressed genes, DESeq() has a good shot at estimating the library size, because not all genes are greatly differentially expressed in a well-designed experiment (or else spike-in controls should have been used).

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thank you for your quick reply. I understand it's essential to to give DESeq2 enough info (genes) to allow more accurate estimation of size factors and dispersion.

 

Can I sort of push my question one step further: for instance, in my experiment, I am most interested in comparing the DE of cytokine mRNAs between 2 conditions (trt vs. control), 10 samples each. So if I just want to look at the DE of cytokine genes (eg. a non-exhaustive list: https://www.rndsystems.com/products/human-cytokine-array-kit) between two experimental conditions, this would likely only include tens ~ maybe a couple of hundreds of genes (out of the 60,000 or so total genes, coding plus non-coding, I got from the RNA-Seq raw data)...

In this case, which would you recommend?

(1) I could run DESeq2 with the limited numbers of rows (using raw counts), or

(2) Is it possible for me to first use all of the ~60,000 genes in 10 samples as the input count matrix , then get the normalized counts, then filter for those cytokine genes, and then run DESeq()... Is it possible to do this? (I know this may sound totally unreasonable request, but...)

(3) Other advice on more appropriate analysis approach?

 

Thank you very much in advance, Michael. Thank you for your time and patience!

 

Alan

ADD REPLY
1
Entering edit mode

You should just run DESeq2 on all the genes. It’s not a good idea to subset to “interesting” ones because of the two problems I outlined above (priors and the scaling factor).

ADD REPLY
0
Entering edit mode

OK, I see. Thank you very much for explanation, Michael. 

Alan

ADD REPLY
0
Entering edit mode

And another (dummy) question, please: a possible solution could be generate the DESeq object and, after size factor estimation, filter out all non-interesting genes? Or this approach could bias DE results? Thanks for your help (and patient!).

ADD REPLY
0
Entering edit mode

No, it's just generally not a good idea, because there are other estimates across all genes that would be disrupted by subsetting to only a few genes.

ADD REPLY
0
Entering edit mode

Ok, thanks for your quick reponse!

ADD REPLY

Login before adding your answer.

Traffic: 464 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6