Dear Bioconductor list,
I'm trying to control for multiple batch confounders in a small RNA-seq experiment. The set-up is a bit complicated and far from ideal, as it involves more than one species and is split across multiple library preparation batches, and two sequencing batches; individual replicates are at the truly biological level and represent different runs of the same experimental conditions. The first sequencing batch contains multiple library preps, one of which (5) is confounded with individual. The second sequencing batch contains only one library prep batch across multiple individuals:
line species ind seqBatch batch
2 1_S5 1 1 1 3
3 1_S6 1 1 1 6
4 1_S3 1 1 2 7
5 1_S4 1 1 2 7
6 2_S2 1 2 2 7
7 2_S1 1 2 1 1
8 2_S2 1 2 1 6
9 2_S3 1 2 1 2
10 2_S5 1 2 2 7
11 2_S4 1 2 1 6
12 3_S8 2 3 1 4
13 3_S9 2 3 1 6
14 4_S5 2 4 2 7
15 4_S6 2 4 2 7
16 4_S7 2 4 2 7
17 5_S10 2 5 1 5
18 5_S11 2 5 1 5
19 5_S12 2 5 1 5
I am interested in robustly identifying genes that are DE between the two species while controlling for batch effects. Both seqBatch and batch are associated with some of the higher principal components in the data, but not as significantly as species or individual.
My questions:
1. Because one of the levels of batch is perfectly confounded with sequencingBatch, a model matrix of the form "~ 0 + species + seqBatch + batch" fails in voom (Coefficients not estimable: as.factor(samplesMeta$batch)7
). I understand why this happens; I'm wondering if anyone has any experience using something like SVA or similar to partition this inaccessible variation into multiple surrogate variables, whether this is an invalid approach (it sounds somewhat shady), and whether I've missed other ways of trying to deal with this problem in my frantic google adventures.
2. On the subject of SVA, a previous post suggested a way of combining voom/limma with SVA (https://support.bioconductor.org/p/54723/), with log2 CPM being acceptable to SVA. However, blocking complicates things. Right now I repeat the duplicateCorrelation call as suggested in other posts to this list:
design <- model.matrix(~ 0 + samplesMeta$species + as.factor(samplesMeta$seqBatch)) designNull <- model.matrix(~0 + as.factor(samplesMeta$seqBatch))
cpmNormLoess <- voom(mergedCountsDgeFilt, design, normalize.method="cyclicloess", plot=F) cpmCorfit <- duplicateCorrelation(cpmNormLoess, design, block=individual) cpmNormLoess <- voom(mergedCountsDgeFilt, design, normalize.method="cyclicloess", plot=T, correlation=cpmCorfit$consensus, block=individual) cpmCorfit <- duplicateCorrelation(cpmNormLoess, design, block=individual) cpmCorfit$consensus [1] 0.4747629
cpmSVA <- sva(cpmNormLoess$E, design, designNull, n.sv=10
)
Calling SVA after these steps resulted in cpmCorfit$consensus > 0.99 and generalised convergence failures in voom across many many genes when updating design to include the inferred surrogate variables, when before there were no failures. On the basis of all of this, is combining sva and a complex limma setup feasible or should I quit while I'm ahead?
Any help or intuition on either point would be most welcome.
Cheers,
Irene
sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final)
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 [8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] sva_3.20.0 genefilter_1.54.2 mgcv_1.8-16 nlme_3.1-128 BiocInstaller_1.22.3 plyr_1.8.4 ape_4.0 [8] variancePartition_1.2.11 foreach_1.4.3 ggplot2_2.1.0 VennDiagram_1.6.17 futile.logger_1.4.3 qvalue_2.4.2 goseq_1.24.0 [15] geneLenDataBase_1.8.0 BiasedUrn_1.07 org.Hs.eg.db_3.3.0 topGO_2.24.0 SparseM_1.74 GO.db_3.3.0 AnnotationDbi_1.34.4 [22] IRanges_2.6.1 S4Vectors_0.10.3 Biobase_2.32.0 graph_1.50.0 BiocGenerics_0.18.0 biomaRt_2.28.0 beeswarm_0.2.3 [29] statmod_1.4.27 edgeR_3.14.0 limma_3.28.21 RColorBrewer_1.1-2 gplots_3.0.1 loaded via a namespace (and not attached): [1] splines_3.3.1 gtools_3.5.0 Rsamtools_1.24.0 RSQLite_1.1-1 lattice_0.20-34 GenomicRanges_1.24.3 XVector_0.12.1 [8] minqa_1.2.4 colorspace_1.3-1 Matrix_1.2-7.1 XML_3.98-1.5 zlibbioc_1.18.0 xtable_1.8-2 scales_0.4.0 [15] gdata_2.17.0 BiocParallel_1.6.6 lme4_1.1-12 annotate_1.50.1 SummarizedExperiment_1.2.3 GenomicFeatures_1.24.5 pbkrtest_0.4-6 [22] survival_2.40-1 magrittr_1.5 doParallel_1.0.10 MASS_7.3-45 tools_3.3.1 matrixStats_0.51.0 stringr_1.1.0 [29] munsell_0.4.3 lambda.r_1.1.9 Biostrings_2.40.2 colorRamps_2.3 GenomeInfoDb_1.8.7 caTools_1.17.1 RCurl_1.95-4.8 [36] nloptr_1.0.4 iterators_1.0.8 bitops_1.0-6 gtable_0.2.0 codetools_0.2-15 DBI_0.5-1 reshape2_1.4.2 [43] GenomicAlignments_1.8.4 rtracklayer_1.32.2 futile.options_1.0.0 KernSmooth_2.23-15 stringi_1.1.2 Rcpp_0.12.8
Hi Jeff,
Thanks so much for your reply - I have zero desire to ignore batch effects, I'm certainly well aware of their importance. I didn't get a say in the experimental set up this time around, so right now I'm trying to tease out the best possible solution from the current data set up.
I tried running SVA on my data and the results were... suboptimal, as I mention in the original post - I don't know if the combination of sva and duplicateCorrelation from limma is making things iffy or similar, I couldn't find much on the subject in the archives here. But actually, to follow up on your answer to Meritxell from years ago which I linked above, because I compare data across species I have to use RPKM, which lets me account for the fact that genes are not the same length. sva gets log2cpm from edgeR, which it is happy enough with; in practice once I'm done normalising and calculating empirical weights with voom I transform the log2 CPM into log2 RPKM, which within limma is fairly robust, going from previous posts here. Provided I can get SVA to work, do you think this will break my results?
Agreed on (a). Better to be safe than sorry for this data set, and include
batch
in your design even if it renders some of your samples useless. This can be avoided next time with better experimental design.Hi Aaron,
Thanks very much for this - the design is definitely suboptimal, yes. I considered ComBat over SVA but I was hoping for a way to work around the confounding of the covariates, which is why I thought SVA might be worth a look.
In answer to your suggestion - both batch and seqBatch are significantly associated with high PCs. batch is significantly associated with PC1 (p = 0.014) whereas seqBatch is associated with PC2 (p = 0.00064). Both species and individual are much more strongly associated with PC1, and, in the case of individual only, also PC2, and the samples clearly separate by species before anything else as one would hope given previous interspecies studies. The first 2 PCs account for ~50% of variation in the data (PC1: 30%, PC2, 22%); given all of that, my intuition is to correct for sequencing batch over library prep batch so I can retain more samples; I'm glad to hear I don't sound totally crazy. The DE results are definitely better with seqBatch than without it, although since we're talking interspecies comparisons we're looking at ~2000 DE genes out of 12000 total no matter what, so in some ways, the differences between methods are kind of uninteresting - there's too many genes to follow up on no matter what!
Cheers,
Irene
Regardless of what the PCA might be suggesting, if you include
batch
, there is no need to includeseqBatch
in the same design matrix. The latter is nested within the former, so if there are any effects due to sequencing batch, they will be modelled by estimation of thebatch
blocking factors.