Question: Accounting for both categorical and numerical possible confounding variables in the limmaVoom pipeline
1
2.4 years ago by
LR030610
LR030610 wrote:

I have an RNA-sequencing dataset that has 8 treatment groups and 2 factors. One factor is "virus" and has 2 levels ("N" = no virus, "V" = virus). The other factor is "diet" and has 4 levels ranging from great to poor diets ("B", "C", "D", "E"). Each treatment group has 6 replicates, creating a total of 48 samples. Each sample is actually a mix of 4 individual specimens exposed to the same conditions. Hence, a total of 192 specimens were used.

For the most part, specimens behaved as expected in terms of morbidity phenotype. Those specimens that were inoculated with virus and had a poor diet showed higher mortality rate than those specimens that were not inoculated with virus and had excellent diet. However, there were some outliers. There were also outliers in terms of other measurements made, such as certain samples from the "non virus" group having large counts of virus (all specimens have some baseline viral level), or certain samples from the "virus" group having small counts of virus.

In any case, I have been using limmaVoom to examine all 28 pairwise combinations of the 8 treatment groups. Some treatment pairs result in 2000 DEGs, but some treatment pairs result in 0 DEGs. It was surprising to see certain treatment pairs have 0 DEGs, especially if the expectation for that particular treatment pair was to show many DEGs. Moreover, when I examined the DEGs, some of them did not seem too reliable (I often saw lots of variability in the replicate count values but small variability between the treatment pairs). This all made me concerned that both the false positive and false negative rates may be rather high.

As a result, I have been trying to examine whether 7 possible confounding variables (some categorical, some numerical) need to be taken into account (either as weights or blocks). To be clear, the caveat is that I feel rather lost and may be completely taking the wrong approach. However, I have been taking approaches mentioned in Section 9.7 (Multi-level Experiments) in the limmaVoom vignette (version revised October 16, 2016). I have also been working off an example I found online at (https://support.bioconductor.org/p/59700/).

As is shown in the MWE below, I basically calculated the number of DEGs and the p-values of all genes for all 28 treatment pair combinations when none of the 7 possible confounding variables were blocked. The results are in the objects sigGenes0 and genePval0. Then, I repeated this process - only now blocking on each of the 7 possible confounding variables separately. This resulted in the objects laneResults, dayResults, mortalityResults, lv1Results, lv2Rsults, rnaConcResults, and rinResults. Unfortunately, 3 of these objects resulted errors ("missing value where TRUE/FALSE needed"). Hence, I only continued with 4 of the confounding variables. For each of these 4 confounding variables separately, I calculated the number of DEGs and the p-values of all genes for all 28 treatment pair combinations (these values are stored in laneResults, dayResults, lv1Results, lv2Results).

I then created three final objects:

1) blockOverallCorr - Shows the resulting correlation from running duplicateCorrelation() with each of the 4 possible confounding variables individually.

2) blockOverallCorr - Shows the correlation between p-values from all genes when no block was used versus when one of the 4 confounding variables was blocked for all 28 treatment pairs.

3) blockDEG - Shows the number of DEGs for all 28 treatment pairs when one of the 4 confounding variables was blocked.

Below is my MWE:

library(limma)
library(Glimma)
library(GGally)
library(ggplot2)
library(edgeR)
library(DESeq2)

################# Create data and setup design
rm(list=ls())
set.seed(1)
counts <- as.data.frame(matrix(rpois(24000, lambda = 500), ncol = 48))
colnames(counts) <- c(paste0("NB", ".", c(1:6)), paste0("NC", ".", c(1:6)), paste0("ND", ".", c(1:6)), paste0("NE", ".", c(1:6)), paste0("VB", ".", c(1:6)), paste0("VC", ".", c(1:6)), paste0("VD", ".", c(1:6)), paste0("VE", ".", c(1:6)))
x <- DGEList(counts=counts)

# Simulate possible confounding variables
exVars <- data.frame("Sample" = colnames(counts), "Lane" = sample(c("L1", "L2"), 48, replace=T), "Day" = sample(c("1", "2"), 48, replace=T), "Mortality" = runif(48,0,1), "LogVirus1" = runif(48,2,7), "LogVirus2" = runif(48,2,7), "rnaConc" = runif(48,50,400), "RIN" = runif(48,7,10))

x$samples$group <- as.factor(unlist(lapply(as.character(exVars$Sample), function(x) substring(unlist(strsplit(x, "[.]"))[1],1)))) x$samples$virus <- as.factor(unlist(lapply(as.character(exVars$Sample), function(x) substring(unlist(strsplit(x, "[.]"))[1],1,1))))
x$samples$diet <- as.factor(unlist(lapply(as.character(exVars$Sample), function(x) substring(unlist(strsplit(x, "[.]"))[1],2,2)))) x$samples$lane <- exVars$Lane
x$samples$day <- exVars$Day x$samples$mortality <- exVars$Mortality
x$samples$lv1 <- exVars$LogVirus1 x$samples$lv2 <- exVars$LogVirus2
x$samples$rnaConc <- exVars$rnaConc x$samples$rin <- exVars$RIN

# Filter out genes that have low read counts (doesn't filter out any genes in this MWE, but does filter out genes in my real data)
keep.exprs <- which(rowSums(x[[1]]>10)>24)
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
counts <- x[[1]]
design <- model.matrix(~0 + x$samples$group)
colnames(design) <- levels(x$samples$group)
cont_matrix <- makeContrasts(NBvsNC = NB-NC, NBvsND = NB-ND, NBvsNE = NB-NE, NBvsVB = NB-VB, NBvsVC = NB-VC, NBvsVD = NB-VD, NBvsVE = NB-VE, NCvsND = NC-ND, NCvsNE = NC-NE, NCvsVB = NC-VB, NCvsVC = NC-VC, NCvsVD = NC-VD, NCvsVE = NC-VE, NDvsNE = ND-NE, NDvsVB = ND-VB, NDvsVC = ND-VC, NDvsVD = ND-VD, NDvsVE = ND-VE, NEvsVB = NE-VB, NEvsVC = NE-VC, NEvsVD = NE-VD, NEvsVE = NE-VE,   VBvsVC = VB-VC, VBvsVD = VB-VD, VBvsVE = VB-VE, VCvsVD = VC-VD, VCvsVE = VC-VE, VDvsVE = VD-VE, levels = colnames(design))

################# No blocking
# Creates sigGenes0, genePval0)
y <- DGEList(counts)
y <- calcNormFactors(y, method = "TMM")
v <- voom(y, design)
fit <- lmFit(v, design)
fit2 <- contrasts.fit(fit, cont_matrix)
fit2 <- eBayes(fit2)

pairNames <- colnames(cont_matrix)
sigGenes0 <- list()
genePval0 <- list()
for (i in 1:length(pairNames)) {
temp <- topTreat(fit2, coef=i, n=Inf)
sigRows <- which(temp$adj.P.Val<0.05) sigGenes0[[ pairNames[i] ]] <- nrow(temp[sigRows,]) genePval0[[ pairNames[i] ]] <- temp$adj.P.Val
}

################# Double voom (voom runs twice)
# Creates laneResults, dayResults, mortalityResults, lv1Results, lv2Results, rnaConcResults, rinResults
doCorr <- function(variable){
y <- DGEList(counts)
y <- calcNormFactors(y, method = "TMM")
v <- voom(y, design)
corfit <- duplicateCorrelation(v, design, block = x$samples[[variable]]) v <- voom(y, design, block = x$samples[[variable]], correlation = corfit$consensus) fit <- lmFit(v, design, block = x$samples[[variable]], correlation = corfit$consensus) fit2 <- contrasts.fit(fit, cont_matrix) fit2 <- eBayes(fit2) pairNames <- colnames(cont_matrix) sigGenes2 <- list() genePval2 <- list() corr2 <- list() for (i in 1:length(pairNames)) { temp <- topTreat(fit2, coef=i, n=Inf) sigRows <- which(temp$adj.P.Val<0.05)
sigGenes2[[ pairNames[i] ]] <- nrow(temp[sigRows,])
genePval2[[ pairNames[i] ]] <- temp$adj.P.Val corr2[[pairNames[i]]] <- cor(genePval0[[pairNames[i]]], genePval2[[pairNames[i]]]) } return(list(overallCorr = corfit$consensus, corrPairs = corr2, sigPairs = sigGenes2))
}

laneResults <- doCorr("lane")
dayResults <- doCorr("day")
mortalityResults <- doCorr("mortality") #Error
lv1Results <- doCorr("LogVirus1")
lv2Results <- doCorr("LogVirus2")
rnaConcResults <- doCorr("rnaConc") #Error
rinResults <- doCorr("rin") #Error

################# Overall summary
# Create blockDEG, blockCorrPairs, blockOverallCorr
blockDEG = data.frame()
for (i in 1:length(pairNames)) {
row <- c(laneResults$sigPairs[[pairNames[i]]], dayResults$sigPairs[[pairNames[i]]], lv1Results$sigPairs[[pairNames[i]]], lv2Results$sigPairs[[pairNames[i]]])
blockDEG <- rbind(blockDEG, row)
}
rownames(blockDEG) <- pairNames
colnames(blockDEG) <- c("lane", "day", "lv1", "lv2")

blockCorrPairs = data.frame()
for (i in 1:length(pairNames)) {
row <- c(laneResults$corrPairs[[pairNames[i]]], dayResults$corrPairs[[pairNames[i]]], lv1Results$corrPairs[[pairNames[i]]], lv2Results$corrPairs[[pairNames[i]]])
blockCorrPairs <- rbind(blockCorrPairs, row)
}
rownames(blockCorrPairs) <- pairNames
colnames(blockCorrPairs) <- c("lane", "day", "lv1", "lv2")

blockOverallCorr <- data.frame()
row <- c(laneResults$overallCorr, dayResults$overallCorr, lv1Results$overallCorr, lv2Results$overallCorr)
blockOverallCorr <- rbind(blockOverallCorr, row)
colnames(blockOverallCorr) <- c("lane", "day", "lv1", "lv2")

In my real data, blockOverallCorr ranged from between -0.22 to 0.09. For my blockCorrPairs object, my real data mostly showed all high correlations (0.99 or above). However, there were a very few number of treatment pairs that showed lower correlation (0.31) when a certain possible confounding variable was blocked. My blockDEG object showed mostly consistency for treatment pairs. In other words, treatment pairs that had 2000 DEGs seemed to remain that large even if a possible confounding variable was blocked, and treatment pairs that had 0 DEGs seemed to remain that small even if a possible confounding variable was blocked.

I strongly sense that my approach is flawed. My main goal is to determine how to account for these 7 possible confounding variables in my model when estimating the DEGs between the 28 treatment pairs. However, my current approach only works on each confounding variable being blocked individually (I cannot use the duplicateCorrelation() command on all confounding variables at the same time).

I am hoping to see that when all 7 possible confounding variables are weighted or blocked appropriately, then the DEGs in the 28 treatment pairs might not have such a large range from 0 to 2000, and the DEGs may look convincing (small variance in replicates and large variance between treatments). I apologize for such a long post, but would love to hear any advice on how I should rethink my approach. Thank you.

My session information is below:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Mavericks 10.9.5

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods
[9] base

other attached packages:
[1] readr_1.1.1                Glimma_1.2.1               rmarkdown_1.6
[4] EDASeq_2.8.0               ShortRead_1.32.1           GenomicAlignments_1.10.1
[7] SummarizedExperiment_1.4.0 Rsamtools_1.26.2           GenomicRanges_1.26.4
[10] GenomeInfoDb_1.10.3        Biostrings_2.42.1          XVector_0.14.1
[13] IRanges_2.8.2              S4Vectors_0.12.2           BiocParallel_1.8.2
[16] Biobase_2.34.0             BiocGenerics_0.20.0        GGally_1.3.1
[19] edgeR_3.16.5               limma_3.30.13              data.table_1.10.4
[22] tidyr_0.6.3                htmlwidgets_0.9            hexbin_1.27.1
[25] plotly_4.7.0               ggplot2_2.2.1.9000         shinydashboard_0.6.1
[31] bindrcpp_0.2               gapminder_0.2.0            dplyr_0.7.1
[34] jsonlite_1.5

loaded via a namespace (and not attached):
[1] colorspace_1.3-2       hwriter_1.3.2          rprojroot_1.2
[4] htmlTable_1.9          base64enc_0.1-3        bit64_0.9-7
[7] AnnotationDbi_1.36.2   codetools_0.2-15       splines_3.3.2
[10] R.methodsS3_1.7.1      DESeq_1.26.0           geneplotter_1.52.0
[13] knitr_1.16             Formula_1.2-1          annotate_1.52.1
[16] cluster_2.0.6          R.oo_1.21.0            packagedocs_0.4.0
[19] httr_1.2.1             backports_1.1.0        assertthat_0.2.0
[22] Matrix_1.2-10          lazyeval_0.2.0         acepack_1.4.1
[25] htmltools_0.3.6        tools_3.3.2            gtable_0.2.0
[28] glue_1.1.1             Rcpp_0.12.11           highlight_0.4.7.1
[31] cellranger_1.1.0       rtracklayer_1.34.2     crosstalk_1.0.1
[34] stringr_1.2.0          mime_0.5               devtools_1.13.2
[37] statmod_1.4.30         XML_3.98-1.9           zlibbioc_1.20.0
[40] scales_0.4.1.9000      aroma.light_3.4.0      hms_0.3
[43] RColorBrewer_1.1-2     yaml_2.1.14            curl_2.7
[46] gridExtra_2.2.1        memoise_1.1.0          biomaRt_2.30.0
[49] rpart_4.1-11           reshape_0.8.6          latticeExtra_0.6-28
[52] stringi_1.1.5          RSQLite_2.0            genefilter_1.56.0
[55] checkmate_1.8.3        GenomicFeatures_1.26.4 rlang_0.1.1
[58] pkgconfig_2.0.1        matrixStats_0.52.2     bitops_1.0-6
[61] evaluate_0.10.1        lattice_0.20-35        purrr_0.2.2.2
[64] bindr_0.1              labeling_0.3           lazyrmd_0.2.0
[67] bit_1.1-12             plyr_1.8.4             magrittr_1.5
[70] DESeq2_1.14.1          R6_2.2.2               Hmisc_4.0-3
[73] DBI_0.7                whisker_0.3-2          foreign_0.8-69
[76] withr_1.0.2            survival_2.41-3        RCurl_1.95-4.8
[79] nnet_7.3-12            tibble_1.3.3           locfit_1.5-9.1
[82] grid_3.3.2             blob_1.1.0             digest_0.6.12
[85] xtable_1.8-2           httpuv_1.3.5           R.utils_2.5.0
[88] munsell_0.4.3          viridisLite_0.2.0     
modified 2.4 years ago • written 2.4 years ago by LR030610
Answer: Accounting for both categorical and numerical possible confounding variables in
3
2.4 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Woah, stop. Blindly flailing around won't do a lot of good here. You say that "It was surprising to see certain treatment pairs have 0 DEGs, especially if the expectation for that particular treatment pair was to show many DEGs". Well, presumably you have some known genes that are expected to be DE between treatment groups, i.e., positive controls. I would suggest having a closer look at their expression profiles to see if there is (i) any "convincing" DE between the relevant groups, or (ii) any systematic association with the uninteresting variables. If not, then it suggests that your expected effect just isn't present, and there's little you can do.

Furthermore, I don't think your MWE is actually representative of your situation. You say that the uninteresting factors are confounded with your groups of interest, but you randomly draw them when you make exVars; as it is now, it's very unlikely that the random factors are confounding. If they're not confounding, then you could just put them in the design matrix directly. This won't lose a lot of residual d.f. and would probably work better than duplicateCorrelation, as there's not much information for estimating the correlation with only two unique levels for Day and Lane. (As you've noticed, you'll get errors for real-valued covariates because the function will treat each value as its own level.)

As for which factors to regress out; I would only consider the purely technical ones, such as the sequencing lane and day. The others are more likely to overlap with the biology of interest, especially in the case of virus versus no-virus comparisons. For example, the viral concentrations will obviously be higher in the infected groups than in the uninfected groups - putting in the concentration as a blocking factor would effectively absorb any DE between virus/no-virus groups and prevent rejection of the null.

Finally, your filtering strategy is very aggressive, requiring genes to be expressed with at least 10 reads in at least 24 samples. What if only one group (of 6 replicate samples) expresses a gene, and it is silent for all other groups? This is a DE gene but would get (incorrectly) tossed out. Instead, you should ask for a count of at least 10 in at least 6 samples - the size of each group.

P.S. And it's good practice to use the latest package versions. We're on Bioconductor 3.5 running on R 3.4.0.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Aaron Lun25k
Answer: Accounting for both categorical and numerical possible confounding variables in
0
2.4 years ago by
LR030610
LR030610 wrote:

Thank you for your detailed and helpful response. To be honest, I do feel that I am blindly flailing around :o) with this next step, which is trying to account for possible confounding variables. In the meantime, I did notice from your post that there is information that I did not provide clearly, and could be related to how I approach this next step:

1) I do not currently have any known genes that I expect to show differential expression. However, I was expecting certain treatment pairs to result in >0 DEGs because they showed different morbidity (which should relate to gene expression changes). However, some pairs that showed very different morbidity resulted in 0 DEGs.

2) I tried to visually examine the validity of the DEGs. For example, in the image (https://ibb.co/dWYoPa), I compared the non-virus/best-diet group (called "NP" here) versus the virus/worst-diet group (called "VS" here). This is a treatment pair where I would expect DEGs (as both factor levels are at opposite ends) and that did result in thousands of DEGs. As a side note, though, the morbidity differences were not as clean between these two treatment pairs (mostly because there was an NP outlier that had unexpectedly high morbidity).

In the image, I used a background side-by-side boxplot that shows the distribution of all logged counts across the 12 samples of NP and VS. Then, I overlaid (in orange) the log counts of the DEG with the lowest adjusted-p-value. The DEG seems rather flat, especially compared to a previous dataset for which I applied this same approach (https://ibb.co/b6gxHv). Of course, I am now working with honeybees, whereas before I was working with soybeans which can be controlled to a much greater degree. Regardless, the flatness in the current honeybee dataset concerns me. I scrolled through many of the DEGs and did not see many that "looked" as would be expected for a DEG (flat between replicates but different between treatments). This all made me concerned that most of the thousands of DEGs are not reliable.

3) I should clarify that the reason I am concerned about the possible confounding variables is that they showed outliers. As I mentioned in my original post, this was the case even with the two background virus levels (in my real data, they are called 'sbv' and 'iapv'). These are log scaled estimations of the number of viruses in a sample before inoculation. To illustrate, I posted an image (https://ibb.co/hsh1AF) that shows the 'sbv' (the left plot) and 'iapv' (the right plot) for the 8 treatment groups. In each plot, the non-virus groups are the four boxes on the left, and the virus groups are the four boxes on the right. The 'iapv' distribution seemed to separate the non-virus and virus groups, but the 'spv' group has substantial overlap/outliers. There are also outliers and overlap in some morbidity values (image not posted). By the way, I spoke with the scientists who collected the data, and they said this overlap and outliers for background viral counts before inoculation is not rare (some honeybees have background virus levels that just happen to be large). Moreover, these background virus counts are not always related to morbidity.

4) I also removed each of the 48 samples individually, and reran the DEG analysis. For one sample, its removal caused the number of DEGs between a treatment pair to decrease from thousands to 3. This sample was not the largest outlier (but was sort of an outlier) in the MDS plot. It had somewhat lowish IAPV virus background levels, but so did other samples that did not cause such a dramatic change in DEG numbers when removed. A few other samples that did not appear as outliers in the MDS plot also caused change in the number of DEGs (but less dramatically so). Beside those cases, though, removal of any one sample did not cause much difference in number of DEGs between treatment pairs.

5) Thank you, I noted your comments about filtering, and updated my OS and R Version as follows:

R version 3.4.1 (2017-06-30)

Platform: x86_64-apple-darwin15.6.0 (64-bit)

Running under: macOS Sierra 10.12.5

6) I see in your post that you recommend (a) Not putting any confounding variables in duplicateCorrelation(), (b) Regressing only on sequencing lane and day. I wonder with my new information, if this would still be recommended? If, instead, the other possible confounding variables should be accounted for, how might I handle them (morbidity is a proportion, and the viruses are *logged* values)? Is it advisable to use them in duplicateCorrelation() and/or regression? I am thinking that because the viruses are *logged* values, they make act more like categorical variables.

I apologize for such a long post. Again, I am very thankful for any support or advice from anyone.

1. Well, it is what it is. For these pairs, the data's telling you that there is no association between morbidity and expression. There's no way to tell if this is caused by some failure in analysis without a positive control.

2. The first plot is opaque to me. Why do you have boxplots across all genes, when you're only interested in a single gene? Are these log-counts, instead of counts? Moreover, you will find that the DE is a lot more convincing if you focus the y-axis to the relevant scale, rather than forcing it to (0, 20). The log-fold change reported by limma is 0.5, so obviously the DE will look flat if your y-axis ranges are too large.

3. If these factors have any effect, you'll have to accept them as part of the biological variability of your system. Clearly they are different between groups, so any attempt to regress them out will wipe out DE between groups.

4. I find this is difficult to imagine, but it suggests that the a single sample is driving the majority of DE between those treatment pairs. You can provide some protection by setting robust=TRUE in eBayes to avoid shrinkage of large variances. You can also replace voom with voomWithQualityWeights to downweight low-quality samples.

6. If your groups differ systematically in virus concentration and morbidity, it would be madness to try to regress them out.