The editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Filtering for genes >100 counts after LRT still showing significant genes with genes with 0 counts!
gravatar for A
12 months ago by
A40 wrote:

Hi all, 

Just wondering if somebody can chime in on a funny problem I am having. I am filtering genes that have a minimum count of 100 to show after an LRT test to look for changes over time. This succesfully removes low count genes in the summary(results), however, when I make a count matrix of significantly expressed genes based on dds that has the filter applied (100 counts), my count matrix contains genes that have counts of 0 and above.. not a minimum of 100 counts! Please see code attached. Any input would be helpful! Many thanks.


ddsWT <- DESeq(ddsWT, test = "LRT", reduced = ~1)
nrow(ddsWT).... 20337 (filter successfully applied)
summary(results(ddsWT))... Low counts removed

Count matrix:

sig_res_LRTWT <- subset(res_LRTWT, padj < padj.cutoff)
sigLRTWT <- rownames(sig_res_LRTWT)
length(sigLRT_genesdeannaWT).... Lower gene number than not applying filter of minimum genes!

ordered_sig_res_LRTWT <- sigLRTWT[order(sigLRTWT$padj), ]
clustering_sig_genesWT <- data.frame(ordered_sig_res_LRTWT[1:14938,])
cluster_rlogWT <- rld_WT[rownames(clustering_sig_genesWT),].... dim(cluster_rlogWT).. 14938 genes

Output (part of)

ENSMUSG00000029814      6     14      7    19      0      0      2      0      2
ENSMUSG00000091955    636    671    648   728    157     61     93    163    103
ENSMUSG00000026556   1052    923   1099  1134    449    293    469    501    443
ENSMUSG00000074968    274    171    256   243    733    565   1022    931    919
ENSMUSG00000039735   3590   3662   2995  3740    673    289    490    553    363
ENSMUSG00000022096   3359   1385   2704  1853   4420   2462   4135   3988   2898
ENSMUSG00000022610    231    126    216   168   1516    819   1407   1109    846
ENSMUSG00000019194   2684   3473   2219  2532   8965   6534   8271   9671   6609
ENSMUSG00000025950   1601   1609   1933  1659    517    295    473    520    457
ENSMUSG00000024426   1817   1453   1618  1446    850    552    746    869    618
ENSMUSG00000020889   1223   1095   1317  1229   1896   1231   2348   2622   2006
ENSMUSG00000031934    430    474    495   566     72     31     66     63     62
ENSMUSG00000031073      0      4      1     2      0      1      0      0      0
ENSMUSG00000028832   5384   3770   5434  5127   2812   1145   1759   2351   2152

There are many lines like the ones highlighted in bold, but I have filtered for genes of min 100 counts... Any input would be much appreciated. Many thanks!



deseq2 lrt cpm lowcountgenes • 278 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by A40

Thank you very much!! And yes, I also carried out the following code before running DESeq, where I also add conditions for how many samples should contain this minimum CPM count

filter <- rowSums(nc >= 100) >= 2

Thanks again!

ADD REPLYlink written 12 months ago by A40

It's up to you, but I typically use a much lower filter if I do any filtering, e.g. normalized count of 10.

ADD REPLYlink written 12 months ago by Michael Love21k

yep! I apologise, actually that was arbitrary... And I am looking at 1-10 as the filter. The reason I went up to 100 is because when I was filtering any count above 0 (i.e. 10), I was seeing so many 0's in the count matrix after the LRT, so increased it to 100 just to see what happens when I have a higher filter.


One further question actually (I did ask this in another post but just wanted to clarify).. If i want to see how genes change in time only, I have  question about which design to use as I am unsure. To look for differences in time only.. Is there a difference to having a full design of design =~Genotype + Time +Genotype:Time

or a full design of ~Genotype+Time

and then reduced of ~Genotype

Does including the interactor of Genotype+Time actually make a difference if i only want to look at differences in time only? If so how and why are the two full designs different apart from the interactor

ADD REPLYlink written 12 months ago by A40

The easiest way to find genes that change in time only would be to specify full=~genotype + time and reduced=~genotype. You can perform test="LRT" to obtain a gene list.

ADD REPLYlink written 12 months ago by Michael Love21k

Many thanks!! Just out of interest though, just so I understand the full design itself, would I get the same list of DE genes if my full design is ~Genotype + Time +Genotype:Time and reduced genotype

as full design of ~Genotype + Time and reduced genotype?


ADD REPLYlink written 12 months ago by A40

No, the first set of designs you have allows for genotypic-specific time changes, whereas the second set of designs has a single time profile with only shifts due to genotype at baseline. The list will definitely be different, although some overlap.

You could get a gene in the significant list for the first set of designs which only changes over time for one genotype. Whereas for the second set of designs, the genes will be more often showing changes over time for all genotypes.

This kind of an analysis choice is up to you.

ADD REPLYlink written 12 months ago by Michael Love21k

Amazing! Thank you so much for making that crystal clear! 


ADD REPLYlink written 12 months ago by A40
Answer: Filtering for genes >100 counts after LRT still showing significant genes with g
gravatar for Michael Love
12 months ago by
Michael Love21k
United States
Michael Love21k wrote:

This step above needs a comma so you filter by rows.

ddsWT <- ddsWT[ rowSums(counts(ddsWT))>100 ]

you should use code that looks like:

dds <- dds[ keep, ]

Also, to save time you can filter before running DESeq().

ADD COMMENTlink written 12 months ago by Michael Love21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 248 users visited in the last hour