Question

limma + voom + sva + complex RNA-seq design?

0

Entering edit mode

irenegallego • 0

@irenegallego-11361

Last seen 9.2 years ago

Dear Bioconductor list,

I'm trying to control for multiple batch confounders in a small RNA-seq experiment. The set-up is a bit complicated and far from ideal, as it involves more than one species and is split across multiple library preparation batches, and two sequencing batches; individual replicates are at the truly biological level and represent different runs of the same experimental conditions. The first sequencing batch contains multiple library preps, one of which (5) is confounded with individual. The second sequencing batch contains only one library prep batch across multiple individuals:

     line species ind seqBatch batch
2    1_S5       1   1        1     3
3    1_S6       1   1        1     6
4    1_S3       1   1        2     7
5    1_S4       1   1        2     7
6    2_S2       1   2        2     7
7    2_S1       1   2        1     1
8    2_S2       1   2        1     6
9    2_S3       1   2        1     2
10   2_S5       1   2        2     7
11   2_S4       1   2        1     6
12   3_S8       2   3        1     4
13   3_S9       2   3        1     6
14   4_S5       2   4        2     7
15   4_S6       2   4        2     7
16   4_S7       2   4        2     7
17  5_S10       2   5        1     5
18  5_S11       2   5        1     5
19  5_S12       2   5        1     5

I am interested in robustly identifying genes that are DE between the two species while controlling for batch effects. Both seqBatch and batch are associated with some of the higher principal components in the data, but not as significantly as species or individual.

My questions:
1. Because one of the levels of batch is perfectly confounded with sequencingBatch, a model matrix of the form "~ 0 + species + seqBatch + batch" fails in voom (Coefficients not estimable: as.factor(samplesMeta$batch)7 ). I understand why this happens; I'm wondering if anyone has any experience using something like SVA or similar to partition this inaccessible variation into multiple surrogate variables, whether this is an invalid approach (it sounds somewhat shady), and whether I've missed other ways of trying to deal with this problem in my frantic google adventures.

2. On the subject of SVA, a previous post suggested a way of combining voom/limma with SVA (https://support.bioconductor.org/p/54723/), with log2 CPM being acceptable to SVA. However, blocking complicates things. Right now I repeat the duplicateCorrelation call as suggested in other posts to this list:

design <- model.matrix(~ 0 + samplesMeta$species + as.factor(samplesMeta$seqBatch))
designNull <- model.matrix(~0 + as.factor(samplesMeta$seqBatch))

cpmNormLoess <- voom(mergedCountsDgeFilt, design, normalize.method="cyclicloess", plot=F) 
cpmCorfit <- duplicateCorrelation(cpmNormLoess, design, block=individual)
cpmNormLoess <- voom(mergedCountsDgeFilt, design, normalize.method="cyclicloess", plot=T, correlation=cpmCorfit$consensus, block=individual) 
cpmCorfit <- duplicateCorrelation(cpmNormLoess, design, block=individual)
cpmCorfit$consensus
[1] 0.4747629

cpmSVA <- sva(cpmNormLoess$E, design, designNull, n.sv=10)

Calling SVA after these steps resulted in cpmCorfit$consensus > 0.99 and generalised convergence failures in voom across many many genes when updating design to include the inferred surrogate variables, when before there were no failures. On the basis of all of this, is combining sva and a complex limma setup feasible or should I quit while I'm ahead?

Any help or intuition on either point would be most welcome.

Cheers,
Irene

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sva_3.20.0               genefilter_1.54.2        mgcv_1.8-16              nlme_3.1-128             BiocInstaller_1.22.3     plyr_1.8.4               ape_4.0                 
 [8] variancePartition_1.2.11 foreach_1.4.3            ggplot2_2.1.0            VennDiagram_1.6.17       futile.logger_1.4.3      qvalue_2.4.2             goseq_1.24.0            
[15] geneLenDataBase_1.8.0    BiasedUrn_1.07           org.Hs.eg.db_3.3.0       topGO_2.24.0             SparseM_1.74             GO.db_3.3.0              AnnotationDbi_1.34.4    
[22] IRanges_2.6.1            S4Vectors_0.10.3         Biobase_2.32.0           graph_1.50.0             BiocGenerics_0.18.0      biomaRt_2.28.0           beeswarm_0.2.3          
[29] statmod_1.4.27           edgeR_3.14.0             limma_3.28.21            RColorBrewer_1.1-2       gplots_3.0.1            

loaded via a namespace (and not attached):
 [1] splines_3.3.1              gtools_3.5.0               Rsamtools_1.24.0           RSQLite_1.1-1              lattice_0.20-34            GenomicRanges_1.24.3       XVector_0.12.1            
 [8] minqa_1.2.4                colorspace_1.3-1           Matrix_1.2-7.1             XML_3.98-1.5               zlibbioc_1.18.0            xtable_1.8-2               scales_0.4.0              
[15] gdata_2.17.0               BiocParallel_1.6.6         lme4_1.1-12                annotate_1.50.1            SummarizedExperiment_1.2.3 GenomicFeatures_1.24.5     pbkrtest_0.4-6            
[22] survival_2.40-1            magrittr_1.5               doParallel_1.0.10          MASS_7.3-45                tools_3.3.1                matrixStats_0.51.0         stringr_1.1.0             
[29] munsell_0.4.3              lambda.r_1.1.9             Biostrings_2.40.2          colorRamps_2.3             GenomeInfoDb_1.8.7         caTools_1.17.1             RCurl_1.95-4.8            
[36] nloptr_1.0.4               iterators_1.0.8            bitops_1.0-6               gtable_0.2.0               codetools_0.2-15           DBI_0.5-1                  reshape2_1.4.2            
[43] GenomicAlignments_1.8.4    rtracklayer_1.32.2         futile.options_1.0.0       KernSmooth_2.23-15         stringi_1.1.2              Rcpp_0.12.8

limma sva batch effect • 3.2k views

ADD COMMENT • link 9.2 years ago irenegallego • 0

score 1 · Answer 1 · 2016-12-12

1

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 2 hours ago

The city by the bay

For your first question, you don't need seqBatch if you have batch in your model. This is because the latter is a more specific version of the former. If you build your design matrix like:

design <- model.matrix(~0 + species + batch)

... and then block on Individual in duplicateCorrelation, then you shouldn't have any problems. However, only samples in batches 6 or 7 will be used in the DE analysis between species. This is because all other batches are nested within a single species, meaning that the batch blocking factors will absorb any change in expression for their samples. Similarly, only samples in batches 5-7 will be used for variance estimation.

Thus, it may be worth asking whether or not the batch effect is substantive enough to warrant modelling. For example, is there strong separation of samples by batch on a MDS plot? Does the number of DE genes increase substantially when you block on batch? If not, then you could consider just dropping batch from your model, or replacing it with the simpler seqBatch. This ensures that information from all samples will be used during variance estimation and hypothesis testing.

For your second question, I'm not sure that sva is what you need here - at least, not at this stage. sva is designed to identify unknown factors of variation; your current problem is in fitting all of your known factors into a single analysis.

ADD COMMENT • link 9.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

I would just add to Aaron's comments that: (a) You should exercise pretty serious caution in interpreting the results if you decide to ignore batch. Batch effects have been one of the major sources of retraction in the scientific literature - if the results of your batch-ignoring analysis don't match with the batch-modeling analysis then you should be extra extra cautious. (b) sva can be used even when batch factors are "known" as it is very common for there to be sources of signal in the data not entirely modeled by the recorded batch variable. depending on how well the meta-data is annotated, it is often useful to try sva without modeling the annotated batch variable and see if you come to similar conclusions. you would first run sva with the design matrix above, leaving out the batch variable, and then including the estimated svs as covariates. Best Jeff On Mon, Dec 12, 2016 at 7:59 AM Aaron Lun [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Aaron Lun <https: support.bioconductor.org="" u="" 6732=""/> wrote Answer: > limma + voom + sva + complex RNA-seq design? > <https: support.bioconductor.org="" p="" 90246="" #90249="">: > > For your first question, you don't need seqBatch if you have batch in > your model. This is because the latter is a more specific version of the > former. If you build your design matrix like: > > design <- model.matrix(~0 + species + batch) > > ... and then block on Individual in duplicateCorrelation, then you > shouldn't have any problems. However, only samples in batches 6 or 7 will > be used in the DE analysis between species. This is because all other > batches are nested within a single species, meaning thatthe batch blocking > factors will absorb any change in expression for their samples. Similarly, > only samples in batches 5-7 will be used for variance estimation. > > Thus, it may be worth asking whether or not the batch effect is > substantive enough to warrant modelling. For example, is there strong > separation of samples by batch on a MDS plot? Does the number of DE genes > increase substantially when you include a batch term? If not, then you > could consider just dropping batch (or only using seqBatch) in your > model. This ensures that information from all samples will be used during > the DE analysis. > > For your second question, I'm not sure that *sva* is what you need here. > *sva* is designed to identify unknown factors of variation; your problem > is in fitting all of your known factors into a single analysis. > ------------------------------ > > Post tags: limma, sva, batch effect > > You may reply via email or visit > A: limma + voom + sva + complex RNA-seq design? >

ADD REPLY • link 9.2 years ago Jeff Leek ▴ 650

0

Entering edit mode

Hi Jeff,

Thanks so much for your reply - I have zero desire to ignore batch effects, I'm certainly well aware of their importance. I didn't get a say in the experimental set up this time around, so right now I'm trying to tease out the best possible solution from the current data set up.

I tried running SVA on my data and the results were... suboptimal, as I mention in the original post - I don't know if the combination of sva and duplicateCorrelation from limma is making things iffy or similar, I couldn't find much on the subject in the archives here. But actually, to follow up on your answer to Meritxell from years ago which I linked above, because I compare data across species I have to use RPKM, which lets me account for the fact that genes are not the same length. sva gets log2cpm from edgeR, which it is happy enough with; in practice once I'm done normalising and calculating empirical weights with voom I transform the log2 CPM into log2 RPKM, which within limma is fairly robust, going from previous posts here. Provided I can get SVA to work, do you think this will break my results?

ADD REPLY • link 9.2 years ago irenegallego • 0

0

Entering edit mode

Agreed on (a). Better to be safe than sorry for this data set, and include batch in your design even if it renders some of your samples useless. This can be avoided next time with better experimental design.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Hi Aaron,

Thanks very much for this - the design is definitely suboptimal, yes. I considered ComBat over SVA but I was hoping for a way to work around the confounding of the covariates, which is why I thought SVA might be worth a look.

In answer to your suggestion - both batch and seqBatch are significantly associated with high PCs. batch is significantly associated with PC1 (p = 0.014) whereas seqBatch is associated with PC2 (p = 0.00064). Both species and individual are much more strongly associated with PC1, and, in the case of individual only, also PC2, and the samples clearly separate by species before anything else as one would hope given previous interspecies studies. The first 2 PCs account for ~50% of variation in the data (PC1: 30%, PC2, 22%); given all of that, my intuition is to correct for sequencing batch over library prep batch so I can retain more samples; I'm glad to hear I don't sound totally crazy. The DE results are definitely better with seqBatch than without it, although since we're talking interspecies comparisons we're looking at ~2000 DE genes out of 12000 total no matter what, so in some ways, the differences between methods are kind of uninteresting - there's too many genes to follow up on no matter what!

Cheers,
Irene

ADD REPLY • link 9.2 years ago irenegallego • 0

0

Entering edit mode

Regardless of what the PCA might be suggesting, if you include batch, there is no need to include seqBatch in the same design matrix. The latter is nested within the former, so if there are any effects due to sequencing batch, they will be modelled by estimation of the batch blocking factors.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 29k