Dear all,
I have an RNASeq experiment of 12 samples for 2 conditions (6 samples per condition).
Reads are paired-end, 100 bp.
I performed differential expression analysis using DESeq2 and was surprised not to find a gene as significantly differentially expressed (it has been reported in an array and another RNASeq experiments studying the same conditions-but different samples).
Here is the DESeq2 procedure I followed:
library("DESeq2") DataFrame=data.frame(Nom,Fichier,HIST)
################################################ ### Model ### ################################################ dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/Results/Data/26022016", design= ~HIST) str(colData(dds_raw)$HIST) dds_raw$HIST = relevel(dds_raw$HIST, ref="Cond1") dds <- DESeq(dds_raw, minReplicatesForReplace=6)##################################### ### Differential expression Tests ### ##################################### resMLE <- results(dds, addMLE=TRUE, cooksCutoff=FALSE,alpha=0.05) summary(resMLE) resMLEOrdered <- resMLE[order(resMLE$padj),] write.table(resMLEOrdered, file="ResMLETable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t") write.table(counts(dds,normalized=TRUE), file="NormalizedTable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
and the sessionInfo
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.10.1 RcppArmadillo_0.6.500.4.0 Rcpp_0.12.3 SummarizedExperiment_1.0.2 Biobase_2.30.0 GenomicRanges_1.22.3
[7] GenomeInfoDb_1.6.1 IRanges_2.4.6 S4Vectors_0.8.7 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 XVector_0.10.0 futile.options_1.0.0 tools_3.2.0 zlibbioc_1.16.0 rpart_4.1-10
[9] RSQLite_1.0.0 annotate_1.48.0 gtable_0.1.2 lattice_0.20-33 DBI_0.3.1 gridExtra_2.0.0 genefilter_1.52.1 cluster_2.0.3
[17] locfit_1.5-9.1 grid_3.2.0 nnet_7.3-11 AnnotationDbi_1.32.3 XML_3.98-1.3 survival_2.38-3 BiocParallel_1.4.3 foreign_0.8-66
[25] latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.48.0 ggplot2_2.0.0 lambda.r_1.1.7 Hmisc_3.17-1 scales_0.3.0 splines_3.2.0
[33] xtable_1.8-0 colorspace_1.2-6 acepack_1.3-3.3 munsell_0.4.2
Here are the normalized counts and stats for my gene of interest:
90.7033407596 0 0 2.3672116923 0 0 939.0444914676 404.8903299633 665.5848352517 0 1034.9196508454 438.2486305103
baseMean log2FoldChange lfcMLE lfcSE stat pvalue padj
297.9798742075 -1.0090565228 -5.2239343815 0.5743666929 -1.7568158725 0.0789491987 0.4351815966
The estimated log2FC is just above 1 but the adjusted p-value is very high, especially compare to this other gene, which is similar to my gene of interest, but with less magnitude and with lower counts:
4.1228791254 0 0.8533500773 0 0 0.7423895021 99.6303301923 89.0758725919 128.7791432768 0 69.6249931217 43.9908663202
baseMean log2FoldChange lfcMLE lfcSE stat pvalue padj
36.4016520173 -3.353879756 -6.1662095289 0.6214945118 -5.3964752585 0.000000068 0.000052119
In condition 2, one value is 0, as in condition 1. But it is also the case for the second gene. It seems weird to me that this second gene is differential but not my gene of interest. I am probably missing something here.
I have posted this question on Seqanswers (http://seqanswers.com/forums/showthread.php?t=66831) and got few comments. I tried for example to exclude the "outlier sample" from the second condition but still, this gene is not differential.
Thank you in advance for any help,
Jane
Hi Jane,
Thanks for sending the dds. So I took a look at this gene, and tried using a likelihood ratio test to see if it would be less sensitive to the outlier count and the p-value is still not small enough to pass multiple test correction:
The problem is that the normalized counts are quite variable for this gene: 0-1035 in one group vs 0-91 in the other, despite that looking by eye it seems very likely to be DE if you ignore the 0 in the first group. Unfortunately, the negative binomial with 6 samples here is sensitive to the variation it sees and only gives this a p-value of 0.07 for the Wald test and 0.03 for the LRT, neither of which will pass multiple test correction.
You could try edgeR robust, which would likely downweight the outlier count in the first group and have more confident inference in this example.
Thank you a lot Michael for your time and help.
When I remove the "outlier sample" from the second condition (6 vs 5), the normalized counts for this gene vary in these ranges: 0-91 in one group vs 409-1035 in the other, but the adjusted p-value is still no significant (http://seqanswers.com/forums/showthread.php?t=66831). I am very surprised.
Anyway, it will be the occasion to test edgeR.
Actually, now that I look at it, I think it's the count of 91 which gives the you a large dispersion estimate for this gene.
Note that you do get a small p-value 0.02, just not small enough for multiple test correction.
Using the edgeR robust sample weighting will help.