Question

prefiletring before PCA and DESeq2 DE analysis

0

Entering edit mode

luca.s ▴ 50

@lucas-24386

Last seen 5 months ago

Italy

Dear Michael, I am new in the field of RNAseq analysis and I find DESeq2 a great and intuitive tool. I read your RNAseq workflow on bioconductor (https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) and I have just a few naive questions that I would like to ask you. These are focused on the startegy of filtering-out low-expressed genes, before proceeding with the analyses (pre-filtering). In the workflow I see keep <- rowSums(counts(dds)) > 1 keep <- rowSums(counts(dds) >= 10) >= 3 (e.g. the smallest group size)

while in DESeq2 analysis page (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) the "suggestion" is keep <- rowSums(counts(dds)) >= 10

So, my newbie's questions are whether there is a general suggestion on how to pre-filter (guess no), and whether fixing a threshold at 10 counts (e.g for a gene in a sample) stems from statistical grounds or it is just a matter of sense. Finally, if I get it correctly, filtering by counts does not take into account potential differences in library sizes. Would it be reasonable to use normalized counts and/or cpm? Any advice on this point? Just for completion, I am working on RNAseq data from human tumor FFPE tissues, and have a limited number of sample (say 30 overall, with smallest group of 10) Sorry for bothering you with these naive questions, and thank you for your time and patience.

deseq2 • 1.8k views

ADD COMMENT • link updated 4.2 years ago by Michael Love 43k • written 4.2 years ago by luca.s ▴ 50

score 0 · Answer 1 · 2020-10-16

DESeq2 does its own filtering for power (to reduce multiple testing by eliminating genes that don't have enough counts for detecting differences). So it's really mostly for reducing dataset size and running time, as it takes time to fit the model over these genes.

You can use counts(dds, normalized=TRUE) if you want here to filter on scaled counts (so that the lowest and highest sequenced samples are brought in line with the typical samples with respect to sequencing depth in the dataset). Often it doesn't make a big difference because in a typical experiment the range is not so wide in terms of sequencing depth. You would then need to run estimateSizeFactors first, before the pre-filtering.