Dear all,
I have an RNASeq experiment of 12 samples for 2 conditions (6 samples per condition).
Reads are paired-end, 100 bp.
I performed differential expression analysis using DESeq2 and was surprised not to find a gene as significantly differentially expressed (it has been reported in an array and another RNASeq experiments studying the same conditions-but different samples).
Here is the DESeq2 procedure I followed:
library("DESeq2") DataFrame=data.frame(Nom,Fichier,HIST)################################################ ### Model ### ################################################ dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/Results/Data/26022016", design= ~HIST) str(colData(dds_raw)$HIST) dds_raw$HIST = relevel(dds_raw$HIST, ref="Cond1") dds <- DESeq(dds_raw, minReplicatesForReplace=6)##################################### ### Differential expression Tests ### ##################################### resMLE <- results(dds, addMLE=TRUE, cooksCutoff=FALSE,alpha=0.05) summary(resMLE) resMLEOrdered <- resMLE[order(resMLE$padj),] write.table(resMLEOrdered, file="ResMLETable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t") write.table(counts(dds,normalized=TRUE), file="NormalizedTable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
and the sessionInfo
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] DESeq2_1.10.1              RcppArmadillo_0.6.500.4.0  Rcpp_0.12.3                SummarizedExperiment_1.0.2 Biobase_2.30.0             GenomicRanges_1.22.3      
 [7] GenomeInfoDb_1.6.1         IRanges_2.4.6              S4Vectors_0.8.7            BiocGenerics_0.16.1       
loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3           XVector_0.10.0       futile.options_1.0.0 tools_3.2.0          zlibbioc_1.16.0      rpart_4.1-10        
 [9] RSQLite_1.0.0        annotate_1.48.0      gtable_0.1.2         lattice_0.20-33      DBI_0.3.1            gridExtra_2.0.0      genefilter_1.52.1    cluster_2.0.3       
[17] locfit_1.5-9.1       grid_3.2.0           nnet_7.3-11          AnnotationDbi_1.32.3 XML_3.98-1.3         survival_2.38-3      BiocParallel_1.4.3   foreign_0.8-66      
[25] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0   ggplot2_2.0.0        lambda.r_1.1.7       Hmisc_3.17-1         scales_0.3.0         splines_3.2.0       
[33] xtable_1.8-0         colorspace_1.2-6     acepack_1.3-3.3      munsell_0.4.2
Here are the normalized counts and stats for my gene of interest:
90.7033407596	0	0	2.3672116923	0	0		939.0444914676	404.8903299633	665.5848352517	0	1034.9196508454	438.2486305103
baseMean	log2FoldChange	lfcMLE	lfcSE	stat	pvalue	padj
297.9798742075	-1.0090565228	-5.2239343815	0.5743666929	-1.7568158725	0.0789491987	0.4351815966
The estimated log2FC is just above 1 but the adjusted p-value is very high, especially compare to this other gene, which is similar to my gene of interest, but with less magnitude and with lower counts:
4.1228791254	0	0.8533500773	0	0	0.7423895021		99.6303301923	89.0758725919	128.7791432768	0	69.6249931217	43.9908663202
baseMean	log2FoldChange	lfcMLE	lfcSE	stat	pvalue	padj
36.4016520173	-3.353879756	-6.1662095289	0.6214945118	-5.3964752585	0.000000068	0.000052119
In condition 2, one value is 0, as in condition 1. But it is also the case for the second gene. It seems weird to me that this second gene is differential but not my gene of interest. I am probably missing something here.
I have posted this question on Seqanswers (http://seqanswers.com/forums/showthread.php?t=66831) and got few comments. I tried for example to exclude the "outlier sample" from the second condition but still, this gene is not differential.
Thank you in advance for any help,
Jane

Hi Jane,
Thanks for sending the dds. So I took a look at this gene, and tried using a likelihood ratio test to see if it would be less sensitive to the outlier count and the p-value is still not small enough to pass multiple test correction:
> res[8443,] log2 fold change (MLE): HIST H3F3A vs HIST1H3B LRT p-value: '~ HIST' vs '~ 1' DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> HOXD8 297.9799 -5.223934 2.053248 4.532549 0.03325605 0.3089909The problem is that the normalized counts are quite variable for this gene: 0-1035 in one group vs 0-91 in the other, despite that looking by eye it seems very likely to be DE if you ignore the 0 in the first group. Unfortunately, the negative binomial with 6 samples here is sensitive to the variation it sees and only gives this a p-value of 0.07 for the Wald test and 0.03 for the LRT, neither of which will pass multiple test correction.
You could try edgeR robust, which would likely downweight the outlier count in the first group and have more confident inference in this example.
Thank you a lot Michael for your time and help.
When I remove the "outlier sample" from the second condition (6 vs 5), the normalized counts for this gene vary in these ranges: 0-91 in one group vs 409-1035 in the other, but the adjusted p-value is still no significant (http://seqanswers.com/forums/showthread.php?t=66831). I am very surprised.
Anyway, it will be the occasion to test edgeR.
Actually, now that I look at it, I think it's the count of 91 which gives the you a large dispersion estimate for this gene.
Note that you do get a small p-value 0.02, just not small enough for multiple test correction.
Using the edgeR robust sample weighting will help.