Hi all,
Just wondering if somebody can chime in on a funny problem I am having. I am filtering genes that have a minimum count of 100 to show after an LRT test to look for changes over time. This succesfully removes low count genes in the summary(results), however, when I make a count matrix of significantly expressed genes based on dds that has the filter applied (100 counts), my count matrix contains genes that have counts of 0 and above.. not a minimum of 100 counts! Please see code attached. Any input would be helpful! Many thanks.
LRT:
ddsWT <- DESeq(ddsWT, test = "LRT", reduced = ~1)
resultsNames(ddsWT)
results(ddsWT)
ddsWT<-ddsWT[rowSums(counts(ddsWT))>100]
nrow(ddsWT).... 20337 (filter successfully applied)
summary(results(ddsWT))... Low counts removed
Count matrix:
res_LRTWT<-results(ddsWT)
padj.cutoff<-0.05
sig_res_LRTWT <- subset(res_LRTWT, padj < padj.cutoff)
sigLRTWT <- rownames(sig_res_LRTWT)
length(sigLRT_genesdeannaWT).... Lower gene number than not applying filter of minimum genes!
ordered_sig_res_LRTWT <- sigLRTWT[order(sigLRTWT$padj), ]
clustering_sig_genesWT <- data.frame(ordered_sig_res_LRTWT[1:14938,])
rld_WT<-assay(ddsWT)
cluster_rlogWT <- rld_WT[rownames(clustering_sig_genesWT),].... dim(cluster_rlogWT).. 14938 genes
Output (part of)
ENSMUSG00000029814 6 14 7 19 0 0 2 0 2
ENSMUSG00000091955 636 671 648 728 157 61 93 163 103
ENSMUSG00000026556 1052 923 1099 1134 449 293 469 501 443
ENSMUSG00000074968 274 171 256 243 733 565 1022 931 919
ENSMUSG00000039735 3590 3662 2995 3740 673 289 490 553 363
ENSMUSG00000022096 3359 1385 2704 1853 4420 2462 4135 3988 2898
ENSMUSG00000022610 231 126 216 168 1516 819 1407 1109 846
ENSMUSG00000019194 2684 3473 2219 2532 8965 6534 8271 9671 6609
ENSMUSG00000025950 1601 1609 1933 1659 517 295 473 520 457
ENSMUSG00000024426 1817 1453 1618 1446 850 552 746 869 618
ENSMUSG00000020889 1223 1095 1317 1229 1896 1231 2348 2622 2006
ENSMUSG00000031934 430 474 495 566 72 31 66 63 62
ENSMUSG00000031073 0 4 1 2 0 1 0 0 0
ENSMUSG00000028832 5384 3770 5434 5127 2812 1145 1759 2351 2152
There are many lines like the ones highlighted in bold, but I have filtered for genes of min 100 counts... Any input would be much appreciated. Many thanks!
Thank you very much!! And yes, I also carried out the following code before running DESeq, where I also add conditions for how many samples should contain this minimum CPM count
filter <- rowSums(nc >= 100) >= 2
ddsWT<-ddsWT[filter,]
nrow(ddsWT)
Thanks again!
It's up to you, but I typically use a much lower filter if I do any filtering, e.g. normalized count of 10.
yep! I apologise, actually that was arbitrary... And I am looking at 1-10 as the filter. The reason I went up to 100 is because when I was filtering any count above 0 (i.e. 10), I was seeing so many 0's in the count matrix after the LRT, so increased it to 100 just to see what happens when I have a higher filter.
One further question actually (I did ask this in another post but just wanted to clarify).. If i want to see how genes change in time only, I have question about which design to use as I am unsure. To look for differences in time only.. Is there a difference to having a full design of design =~Genotype + Time +Genotype:Time
or a full design of ~Genotype+Time
and then reduced of ~Genotype
Does including the interactor of Genotype+Time actually make a difference if i only want to look at differences in time only? If so how and why are the two full designs different apart from the interactor
The easiest way to find genes that change in time only would be to specify
full=~genotype + time
andreduced=~genotype
. You can perform test="LRT" to obtain a gene list.Many thanks!! Just out of interest though, just so I understand the full design itself, would I get the same list of DE genes if my full design is ~Genotype + Time +Genotype:Time and reduced genotype
as full design of ~Genotype + Time and reduced genotype?
No, the first set of designs you have allows for genotypic-specific time changes, whereas the second set of designs has a single time profile with only shifts due to genotype at baseline. The list will definitely be different, although some overlap.
You could get a gene in the significant list for the first set of designs which only changes over time for one genotype. Whereas for the second set of designs, the genes will be more often showing changes over time for all genotypes.
This kind of an analysis choice is up to you.
Amazing! Thank you so much for making that crystal clear!