Question

deseq2 filter the low counts

3

Entering edit mode

aristotele_m ▴ 40

@aristotele_m-6821

Last seen 8.8 years ago

Italy

Dear all,

How can I filter the counts with low count in Deseq2? Any suggestion on how to do?

dds<-DESEq(dt)

count<-counts(dds,normalize=TRUE)

filter<-rowsum(count)> 10

thanks so much!!

deseq2 lowcount • 65k views

ADD COMMENT • link updated 3 months ago by Michael Love 43k • written 11.0 years ago by aristotele_m ▴ 40

score 19 · Answer 1 · 2015-02-27

19

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

hi,

If you want to filter, you can do so before running DESeq:

dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) >= 5 ) >= 3

This would say, e.g. filter out genes where there are less than 3 samples with normalized counts greater than or equal to 5.

then:

dds <- dds[idx,]
dds <- DESeq(dds)

However, you typically don't need to pre-filter because independent filtering occurs within results() to save you from multiple test correction on genes with no power (see ?results and the vignette section about independent filtering, or the paper). The main reason to pre-filter would be to increase speed. Designs with many samples and many interaction terms can slow down on genes which have very few reads.

ADD COMMENT • link 11.0 years ago • updated 9.8 years ago Michael Love 43k

0

Entering edit mode

Does it mean that DESeq2 has no problem estimating dispersions for low expressed features, as opposed to voom?

ADD REPLY • link 10.3 years ago Nik Tuzov ▴ 90

0

Entering edit mode

In our DESeq2 paper we discuss a case where estimation of dispersion is difficult for genes with very, very low average counts. See the methods. However it doesn't really effect the outcome because these genes have almost no power for detecting differential expression.

ADD REPLY • link 10.3 years ago • updated 9.8 years ago Michael Love 43k

0

Entering edit mode

Sorry, right now I faced a point, please help me to be cleared if I am wrong. I noticed that in differential expression analysis by DESeq2, the distribution of read counts of differentially expressed genes is in favour of more highly expressed genes. I mean, likely DESeq2 has a threshold for ignoring too low expressed genes before differential expression analysis. Actually I was expected the genes with too low read counts or zeros are the reason of differential expression but box plot shows that the DE genes are among the genes with higher reads counts.

ADD REPLY • link 7.8 years ago AZ ▴ 20

1

Entering edit mode

More highly expressed genes have higher power for detecting DE.

And yes we do have an internal filter that optimizes this.

Like all important aspects of the method it is discussed in the paper and in the vignette.

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Thank you, you alright

ADD REPLY • link 7.8 years ago AZ ▴ 20

0

Entering edit mode

Sorry, by using these lines

dds <- DESeq(dds, minReplicatesForReplace=Inf)
res <- results(dds, cooksCutoff=FALSE, independentFiltering=FALSE)

will I prevent internal filtering in DESeq2 to remove any genes in differential expression? I am right??

ADD REPLY • link 7.8 years ago AZ ▴ 20

1

Entering edit mode

Yes this will turn off independent filtering (on the mean of counts), as well as outlier replacement and outlier-based gene filtering.

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, thanks for your posts - they are really helpful! I was wondering though, isn't there any issue with using `estimateSizeFactors(dds)` twice? Because the DESeq function is going to do this again, no?

ADD REPLY • link 7.3 years ago rodrigo.duarte88 ▴ 40

1

Entering edit mode

DESeq() does not re-estimate size factors. It will print this message also when you run it.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, would like to have an update on your explanation on this strip of code:

idx <- rowSums( counts(dds, normalized=TRUE) >= 5 ) >= 3

You said that this means that it would filter out genes where there are less than 3 samples with normalized counts greater than or equal to 5. But in turn, isn't it the opposite since we have the ">=" symbol .So it filters out genes that are more than 3 samples ,right?

Hope to hear from you regarding this.

ADD REPLY • link 5.0 years ago mrinal • 0

2

Entering edit mode

Better to call it keep:

keep <- rowSums( counts(dds) >= X ) >= Y
dds <- dds[keep,]

This requires genes to have Y or more samples with counts of X or more. It therefore filters out genes that have less than Y samples with counts of X or more.

ADD REPLY • link 5.0 years ago Michael Love 43k

0

Entering edit mode

Hi Micheal,

I would like to know how I can choose the X and Y numbers here? How do I have to check it in my data to know how I have to put the cut-off and keep the samples?

Looking forward to hearing from you. thanks!

ADD REPLY • link 2.7 years ago Sara • 0

2

Entering edit mode

X, a good value is 10

Y, choose something like, the smallest group sample size

ADD REPLY • link 2.7 years ago Michael Love 43k

0

Entering edit mode

Thank you fo your comment. Just to make it clear for myself regarding the smallest group sample size; you mean if I have 20 individuals as cases and 24 individuals as controls, take the number of 20? (is it related to the number of cases or is it something else) ? keep <- rowSums( counts(dds) >= 10 ) >= 20

Am I right? thanks! looking forward to hearing from you

ADD REPLY • link 2.7 years ago Sara • 0

0

Entering edit mode

Yes, 20 (smallest sample size of the groups).

ADD REPLY • link 2.7 years ago Michael Love 43k

0

Entering edit mode

Hi Michael Love ,

What if there are control and disease groups, do you have any comments on filtering based on groups? For example, the X and Y cutoffs are applied to each group instead of all samples?

Thanks.

ADD REPLY • link 2.1 years ago Xiao • 0

score 0 · Answer 2 · 2016-09-29

0

Entering edit mode

pkachroo ▴ 10

@pkachroo-11576

Last seen 5.3 years ago

I have a similar question: In an experiment with 5 strains in triplicates, I have a gene with the following normalized counts:

Replicates

Strain-1: 0,0,0

Strain-2: 0,0,0

Strain-3: 1.6,1.3,0

Strain-4: 0,0, 2.6

Strain-5: 105,102,101

After running DESeq2, this gene is flagged and given "NA" for pvalue and adjusted.value, which makes sense. However, when I rerun the analysis with only first two replicates per strain (highlighted bold) and compare strain 5 and 4, this gene comes up as differentially expressed: baseMean=9.5 and log2FoldChange=3.3. I am wondering why is this gene not being flagged? and more importantly, how is deseq2 able to compute a fold change when the normalized counts for this gene in strain-4 are zeros.

Appreciate your help.

Priyanka Kachroo

ADD COMMENT • link 9.4 years ago pkachroo ▴ 10

1

Entering edit mode

The question about calculating fold changes when strain 4 has zeros has been answered on the site a few times but it's difficult to find the post. The short answer is that the DESeq2 statistical model (see paper) uses a prior distribution on the fold changes, and returns posterior estimates. So the posterior is a balance of the likelihood (which would give an infinite fold change) and the prior which is calculated based on the range of fold changes from the most DE genes.

Regarding the NA's:

If you read in the help page ?results about NA values in the pvalue column:

By default, results assigns a p-value of NA to genes containing count outliers, as identified using Cook’s distance. See the cooksCutoff argument for control of this behavior.

Then if you read more:

cooksCutoff - theshold on Cook’s distance, such that if one or more samples for a row have a distance higher, the p-value for the row is set to NA. The default cutoff is the .99 quantile of the F(p, m-p) distribution, where p is the number of coefficients being fitted and m is the number of samples. Set to Inf or FALSE to disable the resetting of p-values to NA. Note: this test excludes the Cook’s distance of samples belonging to experimental groups with only 2 samples.

ADD REPLY • link 9.4 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, I have another question on a similar note.

In my project, I'm comparing 10 AML samples with TP53 mutation with 6 AML samples with wild-type TP53 - so there's a little sample imbalance. I would like to filter out genes with counts of 5 or more in less than half the amount of samples per group. Since there's two groups, TP53 MUT and TP53 WT, there'd be two filters:

keep1 <- rowSums( counts(dds1) >= 5 ) >= 5    # Where dds1 is a deseq dataset only with the TP53 MUT samples (5 is half of my N of 10) 
keep2 <- rowSums( counts(dds2) >= 5 ) >= 3    # Where dds2 is a deseq dataset only with the TP53 WT samples (3 is half of my N of 6)

Is it possible to perform such filtering, with one sample threshold per group? And if so, how would the code for manipulation of my dds (that so far contains all genes and all samples) would look like? - breaking it into 2 dds datasets (one per group), applying the two filters, merging the 2 dds datasets into a final one, removing any duplicates, and then running Deseq2? Or is there a way of applying this filter with one threshold per group without the need to create a dds per group?

Thank you so much in advance,

ADD REPLY • link 2.4 years ago daiane.hemerichbrennan • 0

0

Entering edit mode

You can't filter the groups separately, you need to use a non-specific filter that doesn't make use of information about which samples are in which design.

Hence we say to use the smallest group size in the count filter. And then you would filter the whole matrix without making use of the sample design at all.

For more on why you can't use the sample information see this paper:

https://www.pnas.org/doi/10.1073/pnas.0914005107

ADD REPLY • link 2.4 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

My matrix consists of 36 samples from 6 treatments (6 samples/ treatment), whether I adjust prefiltering parameters or not, in my comparisons I am still getting some differentially expressed genes that only have abundance in 1 out of 12 samples.

The FAQs in the DESeq2 vignette suggest we recommend users to run samples from all groups together, and I understand the issue of bias from expression of X in at least Y samples of group Z, but I don't understand why expression of X in at least Y samples of group Z vs. A is incorrect? If I was running each comparison individually this would be correct.

For each comparison, is there a way, either pre- or post stats, to filter out genes with 'X in at least Y samples' or maybe display only genes with 'X in at least Y samples'?

Many thanks for your time Teresa

ADD REPLY • link 4 months ago empty_mt • 0

0

Entering edit mode

Have you seen this part:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering

ADD REPLY • link 4 months ago Michael Love 43k

0

Entering edit mode

Thank you for you suggestion to look at the pre-filtering section in the DESeq2 vignettes - am I missing something?

I have 6 treatments, 6 samples each (total n=36), I have 4 comparisons (contrasts) of interest (total n=12/comparison). If I adjust default pre-filter params, the pre-filter options work great across the whole sample matrix, it reduces the volume into DESeq2. But within each comparison (contrast), if I trace some DEGs back to the original salmon merged gene counts (or even to normalised counts), I get some DEG results with abundance in only 1 sample out of 12.

How do I fix/ work around this? There must be a way to filter within each comparison, either pre-stats (once VST has be done) or post hoc?

Please excuse my basic explanation, but something like:

1. perform VST
2. filter normalised counts using >= X,  >= Y, in comparison 1 (in my case it would be >=1, >=6, in comparison 1)
3. Save to new file
4. in VST data- find and separate results of step 2 (normalised) 
5. separate and save as new file (comparison 1)
6. run stats on this

Or post hoc:

1. search normalised counts for DEGs (DEG results for comparison 1)
2. apply filter to comparison 1 normalised counts (in my case it would be >=1, >=6, in comparison 1) 
3. Removed normalised filtered gene results from DEG table.

My working back to the original merged counts for the first comparison (below in contrasts):

Example Global Pre-filtering:

    // Filtering options
     filtering_min_samples           = 5.0
     filtering_min_abundance         = 1.0

Example Result of pre-filtering reduction:

Input was a matrix of 69389 genes for 36 samples, reduced to 41690 genes after filtering for low abundance.

Example DEG in comparison 1:

    gene_id baseMean    log2FoldChange  lfcSE   pvalue  padj
    ENSSSAG00000006734  0.3893752   -11.70838   10.69647484 0.00002290416   0.02075814

The DEG VST counts in comparison 1:

     D1L D1L D1L D1L D1L D1L D1PAIR D1PAIR D1PAIR D1PAIR D1PAIR D1PAIR
    gene_id H9_1 H9_2 H17_1 H17_2 H1_1 H1_2 H2_1 H2_2 H7_1 H7_2 H15_1 H15_2
    ENSSSAG00000006734 4.77 4.77 4.77 4.77 5.33 4.77 4.77 4.77 4.77 4.77 4.77 4.77

The DEG normalised counts in comparison 1:

        D1L D1L D1L D1L D1L D1L D1PAIR  D1PAIR  D1PAIR  D1PAIR  D1PAIR  D1PAIR
    gene_id H9_1    H9_2    H17_1   H17_2   H1_1    H1_2    H2_1    H2_2    H7_1    H7_2    H15_1   H15_2
    ENSSSAG00000006734  0   0   0   0   3.548172273 0   0   0   0   0   0   0

The DEG in comparison 1 from salmon.merged.gene_counts_length_scaled:

        D1H D1H D2PAIR  D2PAIR  D2H D2H D2L D2L D2H D2H D1PAIR  D1PAIR  D1H D1H D1L D1L D2L D2L D1L D1L D1PAIR  D1PAIR  D1H D1H D2H D2H D2L D2L D2PAIR  D2PAIR  D1PAIR  D1PAIR  D2PAIR  D2PAIR  D1L D1L
    gene_id H10_1   H10_2   H11_1   H11_2   H12_1   H12_2   H13_1   H13_2   H14_1   H14_2   H15_1   H15_2   H16_1   H16_2   H17_1   H17_2   H18_1   H18_2   H1_1    H1_2    H2_1    H2_2    H3_1    H3_2    H4_1    H4_2    H5_1    H5_2    H6_1    H6_2    H7_1    H7_2    H8_1    H8_2    H9_1    H9_2
    ENSSSAG00000006734  0   0   0   0   0   2.014868833 0   1.16021696  0   0   0   0   0   0   0   0   1.099274509 0   3.548172273 0   0   0   0   0   0   0   0   0   0   0   0   0   0   5.050406692 0   0

Contrasts file:

     id,variable,reference,target,blocking
    diet_1_low_oxygen_vs_diet_1_pair,comparison,Diet_1_satiation_low,Diet_1_pair_high,comparison;rep
    diet_2_low_oxygen_vs_diet_2_pair,comparison,Diet_2_satiation_low,Diet_2_pair_high,comparison;rep
    diet_1_low_oxygen_vs_diet_2_low_oxygen,comparison,Diet_1_satiation_low,Diet_2_satiation_low,comparison;rep
    diet_1_pair_vs_diet_2_pair,comparison,Diet_1_pair_high,Diet_2_pair_high,comparison;rep

ADD REPLY • link 3 months ago empty_mt • 0

1

Entering edit mode

But within each comparison (contrast), if I trace some DEGs back to the original salmon merged gene counts (or even to normalised counts), I get some DEG results with abundance in only 1 sample out of 12.

This is my confusion. If you were to restrict to filtering > 1 sample having sufficient abundance, this can't appear in results.

This is a 10 year old thread, maybe you can post a new thread with your code and the row that you don't expect to find (show it's counts) and I can see why you such a gene persists despite a filter that should have removed it.

ADD REPLY • link 3 months ago Michael Love 43k