Question

DESeq2 sizefactors counfounded with biological treatment

0

Entering edit mode

Anna • 0

@e8300812

Last seen 13 months ago

United States

Hello,

I have a situation where I have RNA-seq data from bacteria isolated from live animals. The treatment group of the animals is directly related to the amount of bacteria we were able to isolate, and this is highly correlated with sequencing depth per sample. As it is, I have samples that have size factors of >1 and others over 150.... (pathogen loads range from ~10 - 1*10^9).

Does anyone have any recommendations on how I should deal with this? The literature has not been helpful so far. One thing I tried and it improved the sizefactors dramatically (i.e., all near or around 1.0) was to manually adjust the sizefactors by scaled log10 values of the pathogen loads. While this "works" I don't know if it is justifiable. Either way, it seems that pathogen load would need to be accounted for somehow, but it doesn't make sense to include it in the model design because I don't expect it to effect gene expression per se, just the amount of data available.


## normalized by pathogen load 
dds <- DESeqDataSetFromMatrix(countData=txi, colData = metadata, design = ~ tx.group)

normalized_load <- metadata$log10_quantity / median(metadata$log10_quantity )  # Normalize to median

sizeFactors(dds) <- normalized_load #four digit number/letter combinations are the samples and numbers below are the sizefactors 

# 2705      2757      2758      2773      2779      2780      2781 
# 1.0000000 1.1128290 0.4173812 0.2624746 0.9337339 1.0642834 1.0287548 
#    2782      2806      2808      2827      2828      2839      2868 
# 0.1222737 0.5702929 1.0141249 0.9923003 0.9436932 0.8445065 1.0004902 
#     VA94     VA942     VA943 
# 1.8676390 1.8676390 1.8676390 

##without normalizing by pathogen load 

dds <- estimateSizeFactors(dds)

sizeFactors(dds) #four digit number/letter combinations are the samples and numbers below are the sizefactors 

# 2705         2757         2758         2773         2779         2780 
# 2.951241e+00 2.746878e+00 1.877879e-02 4.221383e-02 3.143724e+00 2.688167e+00 
#        2781         2782         2806         2808         2827         2828 
# 3.082969e+00 1.487099e-02 1.848918e-02 2.305663e+00 1.390208e+00 6.259598e-03 
#        2839         2868         VA94        VA942        VA943 
# 8.771875e-01 1.570124e+00 1.629753e+02 8.095667e+01 8.706616e+01 



sessionInfo( )

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] car_3.1-2                   carData_3.0-5              
 [3] VennDiagram_1.7.3           futile.logger_1.4.3        
 [5] boot_1.3-28.1               AnnotationForge_1.44.0     
 [7] GOSim_1.40.0                GO.db_3.18.0               
 [9] gprofiler2_0.2.3            Gviz_1.46.1                
[11] clusterProfiler_4.10.1      pathview_1.42.0            
[13] DEGreport_1.38.5            apeglm_1.24.0              
[15] EnhancedVolcano_1.20.0      ggrepel_0.9.5              
[17] ComplexHeatmap_2.18.0       pheatmap_1.0.12            
[19] vsn_3.70.0                  circlize_0.4.16            
[21] biomaRt_2.58.2              tximportData_1.30.0        
[23] tximport_1.30.0             sva_3.50.0                 
[25] BiocParallel_1.36.0         genefilter_1.84.0          
[27] mgcv_1.8-42                 nlme_3.1-162               
[29] RColorBrewer_1.1-3          gplots_3.1.3.1             
[31] GenomicFeatures_1.54.4      annotate_1.80.0            
[33] XML_3.99-0.17               AnnotationDbi_1.64.1       
[35] lmerTest_3.1-3              lme4_1.1-35.5              
[37] Matrix_1.6-5                Rmisc_1.5.1                
[39] reshape2_1.4.4              lmtest_0.9-40              
[41] zoo_1.8-12                  plyr_1.8.9                 
[43] data.table_1.15.4           MASS_7.3-60                
[45] adegenet_2.1.10             ade4_1.7-22                
[47] rgl_1.3.1                   arrayQualityMetrics_3.58.0 
[49] GGally_2.2.1                vegan_2.6-6.1              
[51] lattice_0.21-8              permute_0.9-7              
[53] ape_5.8                     lubridate_1.9.3            
[55] forcats_1.0.0               stringr_1.5.1              
[57] dplyr_1.1.3                 purrr_1.0.2                
[59] readr_2.1.5                 tidyr_1.3.1                
[61] tibble_3.2.1                ggplot2_3.5.1              
[63] tidyverse_2.0.0             edgeR_4.0.16               
[65] limma_3.58.1                DESeq2_1.42.0              
[67] SummarizedExperiment_1.32.0 Biobase_2.62.0             
[69] MatrixGenerics_1.14.0       matrixStats_1.2.0          
[71] GenomicRanges_1.54.1        GenomeInfoDb_1.38.5        
[73] IRanges_2.36.0              S4Vectors_0.40.2           
[75] BiocGenerics_0.48.1        

loaded via a namespace (and not attached):
  [1] dichromat_2.0-0.1           vroom_1.6.5                
  [3] progress_1.2.3              nnet_7.3-19                
  [5] Biostrings_2.70.1           affyPLM_1.78.0             
  [7] vctrs_0.6.4                 corpcor_1.6.10             
  [9] digest_0.6.33               png_0.1-8                  
 [11] shape_1.4.6.1               deldir_2.0-4               
 [13] reshape_0.8.9               httpuv_1.6.12              
 [15] foreach_1.5.2               qvalue_2.34.0              
 [17] withr_3.0.0                 ggfun_0.1.5                
 [19] psych_2.4.6.26              xfun_0.46                  
 [21] survival_3.5-5              memoise_2.0.1              
 [23] hexbin_1.28.3               gson_0.1.0                 
 [25] systemfonts_1.1.0           tidytree_0.4.6             
 [27] GlobalOptions_0.1.2         gtools_3.9.5               
 [29] KEGGgraph_1.62.0            logging_0.10-108           
 [31] Formula_1.2-5               prettyunits_1.2.0          
 [33] KEGGREST_1.42.0             promises_1.2.1             
 [35] httr_1.4.7                  restfulr_0.0.15            
 [37] rstudioapi_0.16.0           generics_0.1.3             
 [39] DOSE_3.28.2                 base64enc_0.1-3            
 [41] curl_5.2.0                  zlibbioc_1.48.0            
 [43] ggraph_2.2.1                polyclip_1.10-7            
 [45] GenomeInfoDbData_1.2.11     SparseArray_1.2.3          
 [47] RBGL_1.78.0                 xtable_1.8-4               
 [49] doParallel_1.0.17           evaluate_0.24.0            
 [51] S4Arrays_1.2.0              BiocFileCache_2.10.2       
 [53] preprocessCore_1.64.0       hms_1.1.3                  
 [55] colorspace_2.1-0            filelock_1.0.3             
 [57] flexmix_2.3-19              magrittr_2.0.3             
 [59] Rgraphviz_2.46.0            modeltools_0.2-23          
 [61] ggtree_3.10.1               viridis_0.6.5              
 [63] later_1.3.1                 SparseM_1.84-2             
 [65] shadowtext_0.1.4            cowplot_1.1.3              
 [67] Hmisc_5.1-3                 pillar_1.9.0               
 [69] iterators_1.0.14            caTools_1.18.2             
 [71] compiler_4.3.1              stringi_1.8.4              
 [73] minqa_1.2.7                 GenomicAlignments_1.38.2   
 [75] crayon_1.5.3                abind_1.4-5                
 [77] BiocIO_1.12.0               ggdendro_0.2.0             
 [79] gridGraphics_0.5-1          gcrma_2.74.0               
 [81] emdbook_1.3.13              locfit_1.5-9.8             
 [83] graphlayouts_1.1.1          org.Hs.eg.db_3.18.0        
 [85] bit_4.0.5                   fastmatch_1.1-4            
 [87] codetools_0.2-19            openssl_2.2.0              
 [89] biovizBase_1.50.0           GetoptLong_1.0.5           
 [91] plotly_4.10.4               mime_0.12                  
 [93] splines_4.3.1               Rcpp_1.0.11                
 [95] dbplyr_2.5.0                HDO.db_0.99.1              
 [97] Rttf2pt1_1.3.12             interp_1.1-6               
 [99] knitr_1.48                  blob_1.2.4                 
[101] utf8_1.2.4                  clue_0.3-65                
[103] AnnotationFilter_1.26.0     fs_1.6.4                   
[105] checkmate_2.3.2             ggplotify_0.1.2            
[107] statmod_1.5.0               tzdb_0.4.0                 
[109] svglite_2.1.3               tweenr_2.0.3               
[111] pkgconfig_2.0.3             tools_4.3.1                
[113] cachem_1.0.8                RSQLite_2.3.4              
[115] viridisLite_0.4.2           DBI_1.2.3                  
[117] numDeriv_2016.8-1.1         fastmap_1.1.1              
[119] rmarkdown_2.27              scales_1.3.0               
[121] gridSVG_1.7-5               Rsamtools_2.18.0           
[123] broom_1.0.6                 patchwork_1.2.0            
[125] coda_0.19-4.1               BiocManager_1.30.23        
[127] ggstats_0.6.0               VariantAnnotation_1.48.1   
[129] graph_1.80.0                rpart_4.1.19               
[131] farver_2.1.2                scatterpie_0.2.3           
[133] tidygraph_1.3.1             yaml_2.3.7                 
[135] latticeExtra_0.6-30         foreign_0.8-84             
[137] rtracklayer_1.62.0          illuminaio_0.44.0          
[139] cli_3.6.1                   lifecycle_1.0.4            
[141] askpass_1.2.0               mvtnorm_1.2-5              
[143] lambda.r_1.2.4              backports_1.5.0            
[145] timechange_0.3.0            gtable_0.3.5               
[147] rjson_0.2.21                seqinr_4.2-36              
[149] parallel_4.3.1              jsonlite_1.8.8             
[151] bitops_1.0-7                bit64_4.0.5                
[153] yulab.utils_0.1.5           base64_2.0.1               
[155] futile.options_1.0.1        bdsmatrix_1.3-7            
[157] GOSemSim_2.28.1             lazyeval_0.2.2             
[159] setRNG_2024.2-1             shiny_1.8.1.1              
[161] ConsensusClusterPlus_1.66.0 htmltools_0.5.8.1          
[163] affy_1.80.0                 enrichplot_1.22.0          
[165] formatR_1.14                rappdirs_0.3.3             
[167] ensembldb_2.26.0            glue_1.6.2                 
[169] XVector_0.42.0              RCurl_1.98-1.14            
[171] treeio_1.26.0               BSgenome_1.70.2            
[173] mnormt_2.1.1                jpeg_0.1-10                
[175] gridExtra_2.3               igraph_2.0.3               
[177] extrafontdb_1.0             R6_2.5.1                   
[179] BeadDataPackR_1.54.0        labeling_0.4.3             
[181] cluster_2.1.4               bbmle_1.0.25.1             
[183] beadarray_2.52.0            aplot_0.2.3                
[185] nloptr_2.1.1                ProtGenerics_1.34.0        
[187] DelayedArray_0.28.0         tidyselect_1.2.1           
[189] htmlTable_2.4.3             ggforce_0.4.2              
[191] xml2_1.3.6                  munsell_0.5.1              
[193] KernSmooth_2.23-21          topGO_2.54.0               
[195] affyio_1.72.0               htmlwidgets_1.6.4          
[197] fgsea_1.28.0                hwriter_1.3.2.1            
[199] rlang_1.1.1                 extrafont_0.19             
[201] fansi_1.0.5

DESeq2 • 536 views

ADD COMMENT • link 14 months ago Anna • 0

score 0 · Answer 1 · 2024-11-27

Your question is off-topic for this site, which is primarily intended to help people with technical issues with Bioconductor packages rather than analysis help. A better place for your question is biostars.org.

That said, the size factors are meant to adjust for sequencing depth, so to a certain extent they should do that without any intervention on your part. However, it is not uncommon for an MDS plot of samples with highly disparate sequencing depth to indicate that the biggest difference is due to sequencing depth. I have seen this issue many times with miRNA-Seq data. The cure for this is not to mess with the size factors. Alternatives you might consider include down-sampling to consistent library sizes, or including sampling depth as a factor in your model, or using sva or RUVseq to generate surrogate variables that may control for the depth.