I have an RNA-sequencing dataset that has 8 treatment groups and 2 factors. One factor is "virus" and has 2 levels ("N" = no virus, "V" = virus). The other factor is "diet" and has 4 levels ranging from great to poor diets ("B", "C", "D", "E"). Each treatment group has 6 replicates, creating a total of 48 samples. Each sample is actually a mix of 4 individual specimens exposed to the same conditions. Hence, a total of 192 specimens were used.
For the most part, specimens behaved as expected in terms of morbidity phenotype. Those specimens that were inoculated with virus and had a poor diet showed higher mortality rate than those specimens that were not inoculated with virus and had excellent diet. However, there were some outliers. There were also outliers in terms of other measurements made, such as certain samples from the "non virus" group having large counts of virus (all specimens have some baseline viral level), or certain samples from the "virus" group having small counts of virus.
In any case, I have been using limmaVoom to examine all 28 pairwise combinations of the 8 treatment groups. Some treatment pairs result in 2000 DEGs, but some treatment pairs result in 0 DEGs. It was surprising to see certain treatment pairs have 0 DEGs, especially if the expectation for that particular treatment pair was to show many DEGs. Moreover, when I examined the DEGs, some of them did not seem too reliable (I often saw lots of variability in the replicate count values but small variability between the treatment pairs). This all made me concerned that both the false positive and false negative rates may be rather high.
As a result, I have been trying to examine whether 7 possible confounding variables (some categorical, some numerical) need to be taken into account (either as weights or blocks). To be clear, the caveat is that I feel rather lost and may be completely taking the wrong approach. However, I have been taking approaches mentioned in Section 9.7 (Multi-level Experiments) in the limmaVoom vignette (version revised October 16, 2016). I have also been working off an example I found online at (https://support.bioconductor.org/p/59700/).
As is shown in the MWE below, I basically calculated the number of DEGs and the p-values of all genes for all 28 treatment pair combinations when none of the 7 possible confounding variables were blocked. The results are in the objects sigGenes0 and genePval0. Then, I repeated this process - only now blocking on each of the 7 possible confounding variables separately. This resulted in the objects laneResults, dayResults, mortalityResults, lv1Results, lv2Rsults, rnaConcResults, and rinResults. Unfortunately, 3 of these objects resulted errors ("missing value where TRUE/FALSE needed"). Hence, I only continued with 4 of the confounding variables. For each of these 4 confounding variables separately, I calculated the number of DEGs and the p-values of all genes for all 28 treatment pair combinations (these values are stored in laneResults, dayResults, lv1Results, lv2Results).
I then created three final objects:
1) blockOverallCorr - Shows the resulting correlation from running duplicateCorrelation() with each of the 4 possible confounding variables individually.
2) blockOverallCorr - Shows the correlation between p-values from all genes when no block was used versus when one of the 4 confounding variables was blocked for all 28 treatment pairs.
3) blockDEG - Shows the number of DEGs for all 28 treatment pairs when one of the 4 confounding variables was blocked.
Below is my MWE:
library(limma) library(Glimma) library(GGally) library(ggplot2) library(edgeR) library(DESeq2) ################# Create data and setup design rm(list=ls()) set.seed(1) counts <- as.data.frame(matrix(rpois(24000, lambda = 500), ncol = 48)) colnames(counts) <- c(paste0("NB", ".", c(1:6)), paste0("NC", ".", c(1:6)), paste0("ND", ".", c(1:6)), paste0("NE", ".", c(1:6)), paste0("VB", ".", c(1:6)), paste0("VC", ".", c(1:6)), paste0("VD", ".", c(1:6)), paste0("VE", ".", c(1:6))) x <- DGEList(counts=counts) # Simulate possible confounding variables exVars <- data.frame("Sample" = colnames(counts), "Lane" = sample(c("L1", "L2"), 48, replace=T), "Day" = sample(c("1", "2"), 48, replace=T), "Mortality" = runif(48,0,1), "LogVirus1" = runif(48,2,7), "LogVirus2" = runif(48,2,7), "rnaConc" = runif(48,50,400), "RIN" = runif(48,7,10)) x$samples$group <- as.factor(unlist(lapply(as.character(exVars$Sample), function(x) substring(unlist(strsplit(x, "[.]"))[1],1)))) x$samples$virus <- as.factor(unlist(lapply(as.character(exVars$Sample), function(x) substring(unlist(strsplit(x, "[.]"))[1],1,1)))) x$samples$diet <- as.factor(unlist(lapply(as.character(exVars$Sample), function(x) substring(unlist(strsplit(x, "[.]"))[1],2,2)))) x$samples$lane <- exVars$Lane x$samples$day <- exVars$Day x$samples$mortality <- exVars$Mortality x$samples$lv1 <- exVars$LogVirus1 x$samples$lv2 <- exVars$LogVirus2 x$samples$rnaConc <- exVars$rnaConc x$samples$rin <- exVars$RIN # Filter out genes that have low read counts (doesn't filter out any genes in this MWE, but does filter out genes in my real data) keep.exprs <- which(rowSums(x[[1]]>10)>24) x <- x[keep.exprs,, keep.lib.sizes=FALSE] counts <- x[[1]] design <- model.matrix(~0 + x$samples$group) colnames(design) <- levels(x$samples$group) cont_matrix <- makeContrasts(NBvsNC = NB-NC, NBvsND = NB-ND, NBvsNE = NB-NE, NBvsVB = NB-VB, NBvsVC = NB-VC, NBvsVD = NB-VD, NBvsVE = NB-VE, NCvsND = NC-ND, NCvsNE = NC-NE, NCvsVB = NC-VB, NCvsVC = NC-VC, NCvsVD = NC-VD, NCvsVE = NC-VE, NDvsNE = ND-NE, NDvsVB = ND-VB, NDvsVC = ND-VC, NDvsVD = ND-VD, NDvsVE = ND-VE, NEvsVB = NE-VB, NEvsVC = NE-VC, NEvsVD = NE-VD, NEvsVE = NE-VE, VBvsVC = VB-VC, VBvsVD = VB-VD, VBvsVE = VB-VE, VCvsVD = VC-VD, VCvsVE = VC-VE, VDvsVE = VD-VE, levels = colnames(design)) ################# No blocking # Creates sigGenes0, genePval0) y <- DGEList(counts) y <- calcNormFactors(y, method = "TMM") v <- voom(y, design) fit <- lmFit(v, design) fit2 <- contrasts.fit(fit, cont_matrix) fit2 <- eBayes(fit2) pairNames <- colnames(cont_matrix) sigGenes0 <- list() genePval0 <- list() for (i in 1:length(pairNames)) { temp <- topTreat(fit2, coef=i, n=Inf) sigRows <- which(temp$adj.P.Val<0.05) sigGenes0[[ pairNames[i] ]] <- nrow(temp[sigRows,]) genePval0[[ pairNames[i] ]] <- temp$adj.P.Val } ################# Double voom (voom runs twice) # Creates laneResults, dayResults, mortalityResults, lv1Results, lv2Results, rnaConcResults, rinResults doCorr <- function(variable){ y <- DGEList(counts) y <- calcNormFactors(y, method = "TMM") v <- voom(y, design) corfit <- duplicateCorrelation(v, design, block = x$samples[[variable]]) v <- voom(y, design, block = x$samples[[variable]], correlation = corfit$consensus) fit <- lmFit(v, design, block = x$samples[[variable]], correlation = corfit$consensus) fit2 <- contrasts.fit(fit, cont_matrix) fit2 <- eBayes(fit2) pairNames <- colnames(cont_matrix) sigGenes2 <- list() genePval2 <- list() corr2 <- list() for (i in 1:length(pairNames)) { temp <- topTreat(fit2, coef=i, n=Inf) sigRows <- which(temp$adj.P.Val<0.05) sigGenes2[[ pairNames[i] ]] <- nrow(temp[sigRows,]) genePval2[[ pairNames[i] ]] <- temp$adj.P.Val corr2[[pairNames[i]]] <- cor(genePval0[[pairNames[i]]], genePval2[[pairNames[i]]]) } return(list(overallCorr = corfit$consensus, corrPairs = corr2, sigPairs = sigGenes2)) } laneResults <- doCorr("lane") dayResults <- doCorr("day") mortalityResults <- doCorr("mortality") #Error lv1Results <- doCorr("LogVirus1") lv2Results <- doCorr("LogVirus2") rnaConcResults <- doCorr("rnaConc") #Error rinResults <- doCorr("rin") #Error ################# Overall summary # Create blockDEG, blockCorrPairs, blockOverallCorr blockDEG = data.frame() for (i in 1:length(pairNames)) { row <- c(laneResults$sigPairs[[pairNames[i]]], dayResults$sigPairs[[pairNames[i]]], lv1Results$sigPairs[[pairNames[i]]], lv2Results$sigPairs[[pairNames[i]]]) blockDEG <- rbind(blockDEG, row) } rownames(blockDEG) <- pairNames colnames(blockDEG) <- c("lane", "day", "lv1", "lv2") blockCorrPairs = data.frame() for (i in 1:length(pairNames)) { row <- c(laneResults$corrPairs[[pairNames[i]]], dayResults$corrPairs[[pairNames[i]]], lv1Results$corrPairs[[pairNames[i]]], lv2Results$corrPairs[[pairNames[i]]]) blockCorrPairs <- rbind(blockCorrPairs, row) } rownames(blockCorrPairs) <- pairNames colnames(blockCorrPairs) <- c("lane", "day", "lv1", "lv2") blockOverallCorr <- data.frame() row <- c(laneResults$overallCorr, dayResults$overallCorr, lv1Results$overallCorr, lv2Results$overallCorr) blockOverallCorr <- rbind(blockOverallCorr, row) colnames(blockOverallCorr) <- c("lane", "day", "lv1", "lv2")
In my real data, blockOverallCorr ranged from between -0.22 to 0.09. For my blockCorrPairs object, my real data mostly showed all high correlations (0.99 or above). However, there were a very few number of treatment pairs that showed lower correlation (0.31) when a certain possible confounding variable was blocked. My blockDEG object showed mostly consistency for treatment pairs. In other words, treatment pairs that had 2000 DEGs seemed to remain that large even if a possible confounding variable was blocked, and treatment pairs that had 0 DEGs seemed to remain that small even if a possible confounding variable was blocked.
I strongly sense that my approach is flawed. My main goal is to determine how to account for these 7 possible confounding variables in my model when estimating the DEGs between the 28 treatment pairs. However, my current approach only works on each confounding variable being blocked individually (I cannot use the duplicateCorrelation() command on all confounding variables at the same time).
I am hoping to see that when all 7 possible confounding variables are weighted or blocked appropriately, then the DEGs in the 28 treatment pairs might not have such a large range from 0 to 2000, and the DEGs may look convincing (small variance in replicates and large variance between treatments). I apologize for such a long post, but would love to hear any advice on how I should rethink my approach. Thank you.
My session information is below:
> sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X Mavericks 10.9.5 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods [9] base other attached packages: [1] readr_1.1.1 Glimma_1.2.1 rmarkdown_1.6 [4] EDASeq_2.8.0 ShortRead_1.32.1 GenomicAlignments_1.10.1 [7] SummarizedExperiment_1.4.0 Rsamtools_1.26.2 GenomicRanges_1.26.4 [10] GenomeInfoDb_1.10.3 Biostrings_2.42.1 XVector_0.14.1 [13] IRanges_2.8.2 S4Vectors_0.12.2 BiocParallel_1.8.2 [16] Biobase_2.34.0 BiocGenerics_0.20.0 GGally_1.3.1 [19] edgeR_3.16.5 limma_3.30.13 data.table_1.10.4 [22] tidyr_0.6.3 htmlwidgets_0.9 hexbin_1.27.1 [25] plotly_4.7.0 ggplot2_2.2.1.9000 shinydashboard_0.6.1 [28] shiny_1.0.3 readxl_1.0.0 downloader_0.4 [31] bindrcpp_0.2 gapminder_0.2.0 dplyr_0.7.1 [34] jsonlite_1.5 loaded via a namespace (and not attached): [1] colorspace_1.3-2 hwriter_1.3.2 rprojroot_1.2 [4] htmlTable_1.9 base64enc_0.1-3 bit64_0.9-7 [7] AnnotationDbi_1.36.2 codetools_0.2-15 splines_3.3.2 [10] R.methodsS3_1.7.1 DESeq_1.26.0 geneplotter_1.52.0 [13] knitr_1.16 Formula_1.2-1 annotate_1.52.1 [16] cluster_2.0.6 R.oo_1.21.0 packagedocs_0.4.0 [19] httr_1.2.1 backports_1.1.0 assertthat_0.2.0 [22] Matrix_1.2-10 lazyeval_0.2.0 acepack_1.4.1 [25] htmltools_0.3.6 tools_3.3.2 gtable_0.2.0 [28] glue_1.1.1 Rcpp_0.12.11 highlight_0.4.7.1 [31] cellranger_1.1.0 rtracklayer_1.34.2 crosstalk_1.0.1 [34] stringr_1.2.0 mime_0.5 devtools_1.13.2 [37] statmod_1.4.30 XML_3.98-1.9 zlibbioc_1.20.0 [40] scales_0.4.1.9000 aroma.light_3.4.0 hms_0.3 [43] RColorBrewer_1.1-2 yaml_2.1.14 curl_2.7 [46] gridExtra_2.2.1 memoise_1.1.0 biomaRt_2.30.0 [49] rpart_4.1-11 reshape_0.8.6 latticeExtra_0.6-28 [52] stringi_1.1.5 RSQLite_2.0 genefilter_1.56.0 [55] checkmate_1.8.3 GenomicFeatures_1.26.4 rlang_0.1.1 [58] pkgconfig_2.0.1 matrixStats_0.52.2 bitops_1.0-6 [61] evaluate_0.10.1 lattice_0.20-35 purrr_0.2.2.2 [64] bindr_0.1 labeling_0.3 lazyrmd_0.2.0 [67] bit_1.1-12 plyr_1.8.4 magrittr_1.5 [70] DESeq2_1.14.1 R6_2.2.2 Hmisc_4.0-3 [73] DBI_0.7 whisker_0.3-2 foreign_0.8-69 [76] withr_1.0.2 survival_2.41-3 RCurl_1.95-4.8 [79] nnet_7.3-12 tibble_1.3.3 locfit_1.5-9.1 [82] grid_3.3.2 blob_1.1.0 digest_0.6.12 [85] xtable_1.8-2 httpuv_1.3.5 R.utils_2.5.0 [88] munsell_0.4.3 viridisLite_0.2.0
1. Well, it is what it is. For these pairs, the data's telling you that there is no association between morbidity and expression. There's no way to tell if this is caused by some failure in analysis without a positive control.
2. The first plot is opaque to me. Why do you have boxplots across all genes, when you're only interested in a single gene? Are these log-counts, instead of counts? Moreover, you will find that the DE is a lot more convincing if you focus the y-axis to the relevant scale, rather than forcing it to
(0, 20)
. The log-fold change reported by limma is 0.5, so obviously the DE will look flat if your y-axis ranges are too large.3. If these factors have any effect, you'll have to accept them as part of the biological variability of your system. Clearly they are different between groups, so any attempt to regress them out will wipe out DE between groups.
4. I find this is difficult to imagine, but it suggests that the a single sample is driving the majority of DE between those treatment pairs. You can provide some protection by setting
robust=TRUE
ineBayes
to avoid shrinkage of large variances. You can also replacevoom
withvoomWithQualityWeights
to downweight low-quality samples.6. If your groups differ systematically in virus concentration and morbidity, it would be madness to try to regress them out.