The analysis is for human RNAseq dataset with control (N=151) and disease condition 1 (N=107) and disease condition 2 (N=260). There is evidence of batch effects as well as effects of some demographic variables between the groups (MDS/PCA plots). To account for those differences this was the design for the analysis:
expression ~ age + sex + race + presence of risk gene allele + presence of infarcts +
presence of additional pathology + batch + postmortem interval (PMI) + condition.
Of these age and PMI are continuous variables while are remaining are factors with 2 (sex, presence of risk allele, presence of infarcts, condition), 3 (race), 4 (presence of additional pathology) and 9 (Batch) levels.
When I analyze all samples together but include contrasts of interest (condition 1 vs control, condition 2 vs control) I get 0 DGE between condition 1 vs control. When I analyze condition 1 vs control separately I get ~100 DGE (48 Down, 69 Up) with marginal adj.p.values (0.02 - 0.04). The CV when I include all the samples is 0.26 and when I analyze control and condition 1 separately is 0.27. In the analysis of control vs condition 1 the range for prior.df is (2.89-4.75) while the range (3.76-4.19) narrows down in the analysis of all the samples. Similarly, the tagwise dispersion difference which was greater in the analysis of control vs condition 1 (0.004 -~38) when compared to analysis of all the samples (0.004- ~29). So does that mean there is more variability in individual observations or is it variability in the estimation when only the two groups are compared ?
I used the voom with weights for all my analysis. Plotting the sample weights against all covariates does not show any obvious trend or segregation for both the control vs condition 1 analysis and the analysis with all samples. The range for weights and sample weights in both the cases i.e. two group comparison vs all samples comparison are very similar with only improvement in the upper bound of the values in the latter case. I utilized the SVA package to identify any latent variables that might influence the gene expression but get n.sv =0 with the model of interest.
I repeated the condition 1 vs control analysis with DESeq2, with the same design and get 13 DGE (5 Up, 8 Down) with similar padj range (0.01-0.04).
In this case would this be the right conclusions: No apparent DGE between condition 1 vs control when all samples are analyzed together is not reflective of the underlying biology. Even with N > 100 , the high degree of variance in the observations associated with condition 1 makes it statistically under-powered to detect changes. Detection of few DGE with marginal adjusted p-values when condition 1 vs control are analyzed separately point to potential for improvement by increasing sample size ?
Here is the relevant code that was same between two group and all group comparison.
y=DGEList(txi$counts)
y=calcNormFactors(y)
mean_log_cpm=aveLogCPM(y)
keep_genes=mean_log_cpm>=1
y=y[keep_genes,]
v=voomWithQualityWeights(y,design)
vfit=lmFit(v,design)
contr.matrix=makeContrasts(c1.ctr=condition1-control,c2.ctr=condition2-control,c2.c1=condition2-condition1, levels=colnames(design))
efit=eBayes(vfit,robust=TRUE)
tfit=topTable(efit,coef=coef,number=Inf,adjust.method="BH")
sample_cv=estimateDisp(y,design,robust=TRUE)
dds=DESeqDataSetFromTximport(txi,colData=samples,design)
dds=DESeq(dds)
res=results(dds,alpha=0.05)
SessionInfo
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)
Matrix products: default
BLAS: /opt/R/lib64/R/lib/libRblas.so
LAPACK: /opt/R/lib64/R/lib/libRlapack.so
locale:
[1] C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] tximport_1.4.0 DESeq2_1.16.1
[3] SummarizedExperiment_1.6.3 DelayedArray_0.2.7
[5] matrixStats_0.52.2 Biobase_2.36.2
[7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2
[9] IRanges_2.10.2 S4Vectors_0.14.3
[11] BiocGenerics_0.22.0 edgeR_3.18.1
[13] limma_3.32.3 sva_3.24.4
[15] BiocParallel_1.10.1 genefilter_1.58.1
[17] mgcv_1.8-22 nlme_3.1-131
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] tools_3.4.0 backports_1.1.0 rpart_4.1-11
[7] Hmisc_4.0-3 DBI_0.7 lazyeval_0.2.0
[10] colorspace_1.3-2 nnet_7.3-12 graphite_1.22.0
[13] gridExtra_2.2.1 bit_1.1-12 compiler_3.4.0
[16] graph_1.54.0 htmlTable_1.9 scales_0.4.1
[19] checkmate_1.8.3 rappdirs_0.3.1 stringr_1.2.0
[22] digest_0.6.12 foreign_0.8-68 DOSE_3.2.0
[25] XVector_0.16.0 base64enc_0.1-3 pkgconfig_2.0.1
[28] htmltools_0.3.6 htmlwidgets_0.9 rlang_0.1.1
[31] RSQLite_2.0 acepack_1.4.1 GOSemSim_2.2.0
[34] RCurl_1.95-4.8 magrittr_1.5 GO.db_3.4.1
[37] GenomeInfoDbData_0.99.0 Formula_1.2-2 Matrix_1.2-12
[40] Rcpp_0.12.12 munsell_0.4.3 stringi_1.1.5
[43] zlibbioc_1.22.0 plyr_1.8.4 qvalue_2.8.0
[46] grid_3.4.0 blob_1.1.0 DO.db_2.9
[49] lattice_0.20-35 splines_3.4.0 annotate_1.54.0
[52] locfit_1.5-9.1 knitr_1.16 fgsea_1.2.1
[55] igraph_1.1.2 geneplotter_1.54.0 reshape2_1.4.2
[58] fastmatch_1.1-0 XML_3.98-1.9 glue_1.2.0
[61] latticeExtra_0.6-28 data.table_1.10.4 gtable_0.2.0
[64] purrr_0.2.4 tidyr_0.7.2 ggplot2_2.2.1
[67] ReactomePA_1.20.2 xtable_1.8-2 reactome.db_1.59.1
[70] survival_2.41-3 tibble_1.3.3 clusterProfiler_3.4.4
[73] rvcheck_0.0.9 AnnotationDbi_1.38.1 memoise_1.1.0
[76] cluster_2.0.6
I'd recommend to pick one package and go with that. Here it seems your primary analysis is limma-voom, so I'm removing the DESeq2 tag.