Removing genes with very low counts in all samples before run DESeq
2
1
Entering edit mode
@alejandrocolaneri-7051
Last seen 6.7 years ago
United States

I'm running DESeq to compare two genotypes responding to a treatment. I turn that was not able to see interactions, at least after apply Benj-Hoch for multiple testing correction. I'm thinking in remove genes that are too lowly expressed with the hope that will increase my power. Now my question is, where is the logical threshold..

I like to do something like

dds <- dds[ rowSums(counts(dds)) > x, ]

which will be a good number for "x", if any?

This will  reduce the number of tests made and therefore improve the multiple testing correction of Benjamin-Hochberg. I hope!

deseq2 statistical power independent filtering • 12k views
1
Entering edit mode
@alejandrocolaneri-7051
Last seen 6.7 years ago
United States

Hi Michael Tx. for the answer. I'm not sure to be able to understand the fundamentals of the statistics supporting filtering in DESeq 2 at this time. I'm a user of this tool more than an expert statistician. However, what about the simple step of removing genes with very low count before run:

dds <- DESeq(dds)

In the RNAseq workflow published by you, you implement this step before running DESeq:

dds <- dds[rowSums(counts(dds)) > 0,]

However I do not see much difference between genes that get "0 reads" in each sample,,or genes that have < 3 reads in each sample.

And frankly comparing genes that got 5, 7 or may be 10 reads still looks wrong to me. It looks to me that there is not need of to much statistic to decide that comparing genes with a maxi of 3 reads per gene make not sense.

Whit that said my question is: can someone (an  hold hand person in comparing transcriptomic data with RNAseq)  give me a reasonable number that I can use  at this time to remove genes without jeopardize my analysis?

What do you thing? I will appreciate your feedback

ALe

0
Entering edit mode

hi Ale,

That subsetting step was used in old documentation, just to clean up the presentation of results. It has no consequence on the results, because these rows with zero counts for all samples, will have all NA's for log fold changes, p-values or adjusted p-values anyway (they are not included in the p-value adjustment step).

I'm not certain if you have read the Independent Filtering section of the vignette (Section 3.8 of the current release). I'll copy it here:

"The results function of the DESeq2 package performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic is found which optimizes the number of adjusted p values lower than a significance level alpha (we use the standard variable name for significance level, though it is unrelated to the dispersion parameter α). The theory behind independent filtering is discussed in greater detail in Section 4.6. The adjusted p values for the genes which do not pass the filter threshold are set to NA."

Here's the relevant section of the man page ?results:

"By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to NA. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the filtered_p function of genefilter are provided here to control or turn off the independent filtering behavior."

Here is the relevant section of the manuscript:

"Due to the large number of tests performed in the analysis of RNA-seq and other genome-wide experiments, the multiple testing problem needs to be addressed. A popular objective is control or estimation of the false discovery rate (FDR). Multiple testing adjustment tends to be associated with a loss of power, in the sense that the false discovery rate for a set of genes is often higher than the individual p-values of these genes. However, the loss can be reduced if genes are omitted from the testing that have little or no chance of being detected as differentially expressed, provided that the criterion for omission is independent of the test statistic under the null hypothesis [21] (see Methods). DESeq2 uses the average expression strength of each gene, across all samples, as its filter criterion, and it omits all genes with mean normalized counts below a filtering threshold from multiple testing adjustment. DESeq2 by default will choose a threshold that maximizes the number of genes found at a user-specified target FDR. In Figures 2A-B and 3, genes found in this way to be significant at an estimated FDR of 10% are depicted in red. "

The genes with small counts are automatically filtered by the results() function to not be included in the p-value adjustment. The filter threshold is chosen to optimize the number of DE genes for a given FDR threshold (this is alpha, an argument to results(), set by default to 0.1).

"I'm thinking in remove genes that are too lowly expressed with the hope that will increase my power....can someone give me a reasonable number that I can use  at this time to remove genes without jeopardize my analysis?"

Yes, a reasonable threshold (given the objective of maximizing the total number of rejections over the entire dataset) is automatically calculated by DESeq2's results() function, using a function from the genefilter Bioconductor package.

0
Entering edit mode
@mikelove
Last seen 17 minutes ago
United States

Hi Alejandro,

See the independent filtering section of the vignette ( vignette("DESeq2") ) and the independent filtering section of the manuscript.