Question

Batch/method effect correction on a cohort of patients with RUVseq and DESeq2

0

Entering edit mode

Alexandre • 0

@85d3fe0b

Last seen 9 months ago

France

Hello everyone and sorry in advance for the long post, I need some advices !

We have been fighting with a heavy batch & method effect in a cohort of bulk RNAseq from patients. As you can see below on my PCA (generated using all genes), I have three groups of samples : green and blue for patient samples done at the same place with one technology, purple for patient samples done at an other place with another technology, and red for two types of normal cells to compare patient with. As you can see the batch and method effect is very important... When doing a pairplots, I see that this batch effect can be seen the first and second PCs.
enter image description here At first I wanted to correct for this effect by including the different cohort in the DESeq2 model with this code below, but the model is not full rank because some cases and mutations are only found in one cohort. Also control sample (red cohort) are all in another cohort. Here is my design table just below.

       Sample_id Disease Mutation_SPI1 Mutation_MYD Secondary_Mutation Complete_mutation                cohort cohort_for_RUVseq cohort_for_RUVseq_simpler
1            WM1      WM          SPI1          MYD               SPI1          MYD_SPI1            old_cohort        old_cohort                old_cohort
2           WM22      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
3           WM23      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
4           WM24      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
5           WM26      WM          none          MYD              CXCR4         MYD_CXCR4            old_cohort        old_cohort                old_cohort
6           WM27      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
7           WM29      WM          none          MYD       CXCR4_ARID1A  MYD_CXCR4_ARID1A            old_cohort        old_cohort                old_cohort
8           WM2a      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
9           WM2b      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
10          WM30      WM          none          MYD               none              none            old_cohort        old_cohort                old_cohort
11          WM31      WM          none          MYD              CXCR4         MYD_CXCR4            old_cohort        old_cohort                old_cohort
12          WM32      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
13          WM33      WM          none          MYD              CXCR4         MYD_CXCR4            old_cohort        old_cohort                old_cohort
14          WM34      WM          none          MYD              CXCR4         MYD_CXCR4            old_cohort        old_cohort                old_cohort
15          WM36      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
16          WM37      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
17         WM41a      WM          none          MYD             ARID1A        MYD_ARID1A            old_cohort        old_cohort                old_cohort
18         WM41b      WM          none          MYD             ARID1A        MYD_ARID1A            old_cohort        old_cohort                old_cohort
19          WM42      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
20          WM43      WM          none          MYD              CXCR4         MYD_CXCR4            old_cohort        old_cohort                old_cohort
21          WM44      WM          none          MYD               none              none            old_cohort        old_cohort                old_cohort
22          WM46      WM          none          MYD               none              none            old_cohort        old_cohort                old_cohort
23          WM47      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
24          WM48      WM          none         none               none              none            old_cohort        old_cohort                old_cohort
25          WM49      WM          none         none               none               MYD            old_cohort        old_cohort                old_cohort
26          WM52      WM          SPI1          MYD         CXCR4_SPI1    MYD_CXCR4_SPI1            old_cohort        old_cohort                old_cohort
27          WM53      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
28          WM54      WM          none          MYD               none              none            old_cohort        old_cohort                old_cohort
29          WM55      WM          none          MYD              CXCR4         MYD_CXCR4            old_cohort        old_cohort                old_cohort
30          WM62      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
31           WM8      WM          none          MYD               none               MYD            old_cohort        old_cohort                old_cohort
32           WM9      WM          SPI1          MYD               SPI1          MYD_SPI1            old_cohort        old_cohort                old_cohort
33          WM91      WM       unknown      unknown            unknown           unknown          new_Novogene          Novogene                  Novogene
34          WM93      WM       unknown      unknown            unknown           unknown          new_Novogene          Novogene                  Novogene
35          WM94      WM       unknown      unknown            unknown           unknown          new_Novogene          Novogene                  Novogene
36          WM95      WM       unknown      unknown            unknown           unknown          new_Novogene          Novogene                  Novogene
37          WM96      WM       unknown      unknown            unknown           unknown          new_Novogene          Novogene                  Novogene
38          WM97      WM       unknown      unknown            unknown           unknown          new_Novogene          Novogene                  Novogene
39          WM84      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
40          WM82      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
41          WM86      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
42          WM87      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
43          WM90      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
44          WM83      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
45          WM89      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
46          WM85      WM       unknown      unknown            unknown           unknown intermediate_Novogene          Novogene                  Novogene
47  switch_mem_1 Healthy    switch_mem   switch_mem         switch_mem        switch_mem             blueprint            memory                 blueprint
48  switch_mem_2 Healthy    switch_mem   switch_mem         switch_mem        switch_mem             blueprint            memory                 blueprint
49  switch_mem_3 Healthy    switch_mem   switch_mem         switch_mem        switch_mem             blueprint            memory                 blueprint
50  unswitch_mem Healthy  unswitch_mem unswitch_mem       unswitch_mem      unswitch_mem             blueprint            memory                 blueprint
51 plasma_cell_1 Healthy   plasma_cell  plasma_cell        plasma_cell       plasma_cell             blueprint       plasma_cell                 blueprint
52 plasma_cell_2 Healthy   plasma_cell  plasma_cell        plasma_cell       plasma_cell             blueprint       plasma_cell                 blueprint
53 plasma_cell_3 Healthy   plasma_cell  plasma_cell        plasma_cell       plasma_cell             blueprint       plasma_cell                 blueprint

So the idea we had was to use RUVseq, more precisely RUVg, to include correcting factor in the DESeq2 model, enabling us to correct the batch effect and compare freely samples between them. For RUVg, I tried different comparisons to determine a good dataset : purple vs only green and blue ; purple vs red vs green&blue ; purple vs green&blue vs red splitted in two (memory B cells and plasma cells). For the sake of clarity, I will only put below the code for the purple vs red vs green&blue. I tried different value of k for RUVg aswell. And then I know that I can include them like this in the DESeq2 model with "~ W1 + W2 + Complete_mutation". Here is my code :

dds <- DESeqDataSetFromTximport(txi.salmon,
                                design_file,
                                design = ~ cohort_for_RUVseq_simpler)

smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]

dds <- DESeq(dds)
vsd <- vst(dds, blind=FALSE)

res_oldcohort_novogene <- results(dds, contrast = c("cohort_for_RUVseq_simpler", "old_cohort", "Novogene"))
not_sig_oldcohort_novogene = rownames(res_oldcohort_novogene)[which(res_oldcohort_novogene$padj > 0.1)]

res_oldcohort_blueprint <- results(dds, contrast = c("cohort_for_RUVseq_simpler", "old_cohort", "blueprint"))
not_sig_oldcohort_blueprint = rownames(res_oldcohort_blueprint)[which(res_oldcohort_blueprint$padj > 0.1)]

res_Novogene_blueprint <- results(dds, contrast = c("cohort_for_RUVseq_simpler", "Novogene", "blueprint"))
not_sig_Novogene_blueprint = rownames(res_Novogene_blueprint)[which(res_Novogene_blueprint$padj > 0.1)]

not_sig_GLOBAL = Reduce(intersect,  list(not_sig_oldcohort_novogene = not_sig_oldcohort_novogene, 
                                         not_sig_oldcohort_blueprint = not_sig_oldcohort_blueprint, 
                                         not_sig_Novogene_blueprint = not_sig_Novogene_blueprint)) # 2653 genes

## Run RUVg sequencing

table_to_see = as.data.frame(counts(dds))

RUVg_test <- newSeqExpressionSet(counts(dds))
RUVg_test_upper_quantile_norm <- betweenLaneNormalization(RUVg_test, which="upper")

empirical <- rownames(RUVg_test_upper_quantile_norm)[ rownames(RUVg_test_upper_quantile_norm) %in% not_sig_GLOBAL ]

correction_RUVg <- RUVg(RUVg_test_upper_quantile_norm, empirical, k=2)
pdat <- pData(correction_RUVg)

My problem is the read-out to say whether or not the correction and the factors extracted from RUVg worked properly. I have generated PCA plots colorizing the points of the PCA with the first two factors from RUVg. For the PCA (ntop = 500 and all genes), I have this for W1 from RUVg. It looks better with only the top 500 most variable genes (second picture) but this is not what I want I think...

All genes enter image description here

I also extracted for the different comparisons the correlation between the W1 and W2 RUVg factors and the coordinate on PC1 and PC2 (PCA with all genes). This can be find on this table.

correlations         SIMPLER_comparison_simple
cor_W1_Dim1      0.172171
cor_W1_Dim2      0.6458019
cor_W2_Dim1      0.4765441
cor_W2_Dim2      -0.2840367

Finally, with this code here, I tried to remove this batch effect directly and see the PCA before (first picture) and after (second picture), which seems worst than without any correction.

empirical <- rownames(RUVg_test_upper_quantile_norm)[ rownames(RUVg_test_upper_quantile_norm) %in% not_sig_GLOBAL ]

correction_RUVg <- RUVg(RUVg_test_upper_quantile_norm, empirical, k=2)
pdat <- pData(correction_RUVg)

vsd <- vst(dds,blind=FALSE)
vsd_to_crush = vsd

mat <- assay(vsd)

mm <- model.matrix(~vsd$Complete_mutation, colData(vsd))

mat2 <- limma::removeBatchEffect(mat, covariates = pdat, design = mm)

assay(vsd_to_crush) = mat2

DESeq2::plotPCA(vsd, intgroup = "cohort", ntop = 100000)
DESeq2::plotPCA(vsd_to_crush, intgroup = "cohort", ntop = 100000)

PCA before removal batch effect PCA after removal batch effect

My questions are the following :

1) Was I right to dive directly into RUVseq and RUVg, and did I miss something that I could do that was simpler.

2) Which read-out for the RUVg method would you use ?

3) Did I make the limma removebatcheffect function work correctly or am I missing something ?

4) If I did what needed to be done to assess the RUVg method, what would you do ? Analyse the cohort separately ?

Thank you in advance for reading and for your precious help ! Alexandre

DESeq2 RUVSeq RUVseq BatchEffect DES • 1.1k views

ADD COMMENT • link updated 9 months ago by James W. MacDonald 67k • written 9 months ago by Alexandre • 0

0

Entering edit mode

My code for the sessionInfo() . If the pictures are too big, I will downsize them. THank you again for reading my post !

> sessionInfo( )
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.utf8  LC_CTYPE=French_France.utf8    LC_MONETARY=French_France.utf8 LC_NUMERIC=C                   LC_TIME=French_France.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] corrplot_0.92               RUVSeq_1.36.0               edgeR_4.0.16                limma_3.54.2                EDASeq_2.36.0               aroma.light_3.32.0         
 [7] ShortRead_1.60.0            GenomicAlignments_1.36.0    Rsamtools_2.16.0            Biostrings_2.66.0           XVector_0.38.0              BiocParallel_1.32.6        
[13] latticeExtra_0.6-30         lattice_0.22-5              RColorBrewer_1.1-3          lubridate_1.9.3             forcats_1.0.0               stringr_1.5.1              
[19] dplyr_1.1.4                 purrr_1.0.2                 tidyr_1.3.0                 tibble_3.2.1                tidyverse_2.0.0             GenomicFeatures_1.54.4     
[25] AnnotationDbi_1.64.1        gplots_3.1.3                ggfortify_0.4.17            gridExtra_2.3               readr_2.1.5                 pcaExplorer_2.24.0         
[31] tximportData_1.26.0         tximport_1.26.1             DESeq2_1.38.3               FactoMineR_2.11             factoextra_1.0.7            devtools_2.4.5             
[37] usethis_2.2.2               ggsignif_0.6.4              ggpubr_0.6.0                readxl_1.4.3                data.table_1.14.10          PCAtools_2.10.0            
[43] ggrepel_0.9.5               ggplot2_3.4.4               magrittr_2.0.3              airway_1.22.0               SummarizedExperiment_1.28.0 Biobase_2.58.0             
[49] GenomicRanges_1.50.2        GenomeInfoDb_1.38.8         IRanges_2.32.0              S4Vectors_0.36.2            BiocGenerics_0.44.0         MatrixGenerics_1.10.0      
[55] matrixStats_1.1.0          

loaded via a namespace (and not attached):
  [1] estimability_1.4.1        rappdirs_0.3.3            SparseM_1.81              rtracklayer_1.60.1        AnnotationForge_1.40.2    R.methodsS3_1.8.2        
  [7] coda_0.19-4               ragg_1.2.7                bit64_4.0.5               knitr_1.45                R.utils_2.12.3            irlba_2.3.5.1            
 [13] multcomp_1.4-25           DelayedArray_0.24.0       hwriter_1.3.2.1           KEGGREST_1.38.0           RCurl_1.98-1.14           doParallel_1.0.17        
 [19] generics_0.1.3            ScaledMatrix_1.6.0        cowplot_1.1.2             TH.data_1.1-2             RSQLite_2.3.4             bit_4.0.5                
 [25] tzdb_0.4.0                webshot_0.5.5             xml2_1.3.6                httpuv_1.6.13             assertthat_0.2.1          viridis_0.6.4            
 [31] xfun_0.41                 hms_1.1.3                 evaluate_0.23             promises_1.2.1            TSP_1.2-4                 fansi_1.0.6              
 [37] restfulr_0.0.15           progress_1.2.3            caTools_1.18.2            dendextend_1.17.1         dbplyr_2.4.0              Rgraphviz_2.42.0         
 [43] igraph_1.6.0              DBI_1.2.1                 geneplotter_1.76.0        htmlwidgets_1.6.4         ellipsis_0.3.2            crosstalk_1.2.1          
 [49] backports_1.4.1           annotate_1.76.0           gridBase_0.4-7            deldir_2.0-2              biomaRt_2.58.2            sparseMatrixStats_1.10.0 
 [55] vctrs_0.6.5               remotes_2.4.2.1           abind_1.4-5               cachem_1.0.8              withr_3.0.0               vroom_1.6.5              
 [61] emmeans_1.9.0             prettyunits_1.2.0         cluster_2.1.6             lazyeval_0.2.2            crayon_1.5.2              genefilter_1.80.3        
 [67] labeling_0.4.3            pkgconfig_2.0.3           pkgload_1.3.4             seriation_1.5.4           rlang_1.1.3               lifecycle_1.0.4          
 [73] miniUI_0.1.1.1            sandwich_3.1-0            registry_0.5-1            filelock_1.0.3            BiocFileCache_2.6.1       GOstats_2.64.0           
 [79] rsvd_1.0.5                cellranger_1.1.0          graph_1.76.0              rngtools_1.5.2            Matrix_1.6-5              carData_3.0-5            
 [85] zoo_1.8-12                base64enc_0.1-3           pheatmap_1.0.12           png_0.1-8                 viridisLite_0.4.2         rjson_0.2.21             
 [91] ca_0.71.1                 bitops_1.0-7              shinydashboard_0.7.2      R.oo_1.25.0               KernSmooth_2.23-22        blob_1.2.4               
 [97] DelayedMatrixStats_1.20.0 multcompView_0.1-9        jpeg_0.1-10               rstatix_0.7.2             shinyAce_0.4.2            beachmat_2.14.2          
[103] scales_1.3.0              leaps_3.1                 memoise_2.0.1             GSEABase_1.60.0           plyr_1.8.9                zlibbioc_1.44.0          
[109] threejs_0.3.3             compiler_4.2.2            dqrng_0.3.2               BiocIO_1.8.0              cli_3.6.2                 urlchecker_1.0.1         
[115] Category_2.64.0           MASS_7.3-60.0.1           tidyselect_1.2.0          stringi_1.8.3             textshaping_0.3.7         shinyBS_0.61.1           
[121] yaml_2.3.8                BiocSingular_1.14.0       locfit_1.5-9.8            grid_4.2.2                timechange_0.3.0          tools_4.2.2              
[127] parallel_4.2.2            rstudioapi_0.15.0         foreach_1.5.2             farver_2.1.1              scatterplot3d_0.3-44      digest_0.6.34            
[133] BiocManager_1.30.22       shiny_1.8.0               Rcpp_1.0.12               car_3.1-2                 broom_1.0.5               later_1.3.2              
[139] httr_1.4.7                colorspace_2.1-0          XML_3.99-0.16             fs_1.6.3                  topGO_2.50.0              splines_4.2.2            
[145] RBGL_1.74.0               systemfonts_1.0.5         plotly_4.10.4             sessioninfo_1.2.2         xtable_1.8-4              jsonlite_1.8.8           
[151] heatmaply_1.5.0           flashClust_1.01-2         R6_2.5.1                  profvis_0.3.8             pillar_1.9.0              htmltools_0.5.7          
[157] mime_0.12                 NMF_0.26                  glue_1.6.2                fastmap_1.1.1             DT_0.31                   codetools_0.2-19         
[163] pkgbuild_1.4.3            mvtnorm_1.2-4             utf8_1.2.4                curl_5.2.0                gtools_3.9.5              interp_1.1-5             
[169] GO.db_3.16.0              survival_3.5-7            rmarkdown_2.25            munsell_0.5.0             GenomeInfoDbData_1.2.9    iterators_1.0.14         
[175] reshape2_1.4.4            gtable_0.3.4

ADD REPLY • link 9 months ago Alexandre • 0

score 2 · Accepted Answer · 2024-04-25

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

You cannot correct for technical variability due to batch if samples are nested in batch. The technical and biological differences are completely confounded, and any reduction in the batch variability will necessarily reduce biological variability as well. You will never know if any differences are due to technical or biological differences, so analyzing these data is pretty much a lost cause.

ADD COMMENT • link 9 months ago James W. MacDonald 67k

0

Entering edit mode

Hello James,

Thank you very much for your answer and taking the time to read my post.

I understand your remark, but isn't the approach of RUVseq trying to get around this problem when correcting, to be free to compare what you want if ever you manage to validate the "transformation" of the data ? Let's say for instance a functionnal validation using GSEA before and after RUVseq for known pathways ? Your conclusion is pretty clear, it is just for me to be sure to understand.

Best regards, Alexandre

ADD REPLY • link 9 months ago Alexandre • 0

0

Entering edit mode

Both RUVseq and sva are meant to be used to remove technical variability. The problem is the identification of variability that is technical for either method to remove.

Here's a simple example. Let's say someone wants to test a diet, and they design the experiment so the people on the diet are all in Europe and the controls are all in the US. To determine if the diet works, they measure the weights of the subjects after 6 months. But the Europeans use a fancy new electronic scale, and the Americans just use a rusty old bathroom scale someone brought from home. Unbenownst to the researchers, the bathroom scale is off by -3 lbs.

The Europeans measure their subjects, and conclude that they weigh 150 lbs on average. The Americans do the same, and report an average of 150 as well. But their actual average is 153! Without a way to unambiguously determine the accuracy of the scale (which you don't have for RNA-Seq), there is no way to say if the results are due to technical differences (the inaccuracies of the bathroom scale) or that the diet really doesn't work.

If Gene X is higher in the red samples as compared to the blue samples, you don't know if that is because of technical differences between the platform or lab, or if it's due to actual changes in the gene expression. And there is no way to determine that without knowing what the actual underlying gene expression is, and if you already knew that, you wouldn't need to do the RNA-Seq in the first place.

ADD REPLY • link 9 months ago James W. MacDonald 67k

0

Entering edit mode

Hello James,

Thank you for your answer and precise exemple ! It is disappointing but makes sense.

Best,

Alexandre

ADD REPLY • link 9 months ago Alexandre • 0

0

Entering edit mode

Is useless the assumption that the majority of genes have no variation in expression (bearing in mind that they are distributed over the entire dynamic measurement range)?

ADD REPLY • link 9 months ago SamGG ▴ 360

0

Entering edit mode

That's an assumption used for normalization of a sample, not an assumption you can use for an individual model. We always normalize to attempt to remove technical variability, but if there are batches, one still incorporates the batch effect in the model. In this case, batch is confounded with the groups being compared, so cannot be added to the model.

If normalization would 'fix' things, nobody would ever consider adding a batch effect, right?

ADD REPLY • link 9 months ago James W. MacDonald 67k

0

Entering edit mode

I was thinking about removing unwanted variations at the sample level (ie a transformation shared across genes) not at the gene level. Correct, it sounds as a normalization, but I didn't say that adding a batch effect in individual model is useless. I will quietly explore RUV further.

ADD REPLY • link 9 months ago SamGG ▴ 360

0

Entering edit mode

Are you commenting on the OP's question, or adding a different question? If the latter, please start a new post rather than piggybacking on existing.

ADD REPLY • link 9 months ago James W. MacDonald 67k