Hello,
I have a situation where I have RNA-seq data from bacteria isolated from live animals. The treatment group of the animals is directly related to the amount of bacteria we were able to isolate, and this is highly correlated with sequencing depth per sample. As it is, I have samples that have size factors of >1 and others over 150.... (pathogen loads range from ~10 - 1*10^9).
Does anyone have any recommendations on how I should deal with this? The literature has not been helpful so far. One thing I tried and it improved the sizefactors dramatically (i.e., all near or around 1.0) was to manually adjust the sizefactors by scaled log10 values of the pathogen loads. While this "works" I don't know if it is justifiable. Either way, it seems that pathogen load would need to be accounted for somehow, but it doesn't make sense to include it in the model design because I don't expect it to effect gene expression per se, just the amount of data available.
## normalized by pathogen load
dds <- DESeqDataSetFromMatrix(countData=txi, colData = metadata, design = ~ tx.group)
normalized_load <- metadata$log10_quantity / median(metadata$log10_quantity ) # Normalize to median
sizeFactors(dds) <- normalized_load #four digit number/letter combinations are the samples and numbers below are the sizefactors
# 2705 2757 2758 2773 2779 2780 2781
# 1.0000000 1.1128290 0.4173812 0.2624746 0.9337339 1.0642834 1.0287548
# 2782 2806 2808 2827 2828 2839 2868
# 0.1222737 0.5702929 1.0141249 0.9923003 0.9436932 0.8445065 1.0004902
# VA94 VA942 VA943
# 1.8676390 1.8676390 1.8676390
##without normalizing by pathogen load
dds <- estimateSizeFactors(dds)
sizeFactors(dds) #four digit number/letter combinations are the samples and numbers below are the sizefactors
# 2705 2757 2758 2773 2779 2780
# 2.951241e+00 2.746878e+00 1.877879e-02 4.221383e-02 3.143724e+00 2.688167e+00
# 2781 2782 2806 2808 2827 2828
# 3.082969e+00 1.487099e-02 1.848918e-02 2.305663e+00 1.390208e+00 6.259598e-03
# 2839 2868 VA94 VA942 VA943
# 8.771875e-01 1.570124e+00 1.629753e+02 8.095667e+01 8.706616e+01
sessionInfo( )
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] car_3.1-2 carData_3.0-5
[3] VennDiagram_1.7.3 futile.logger_1.4.3
[5] boot_1.3-28.1 AnnotationForge_1.44.0
[7] GOSim_1.40.0 GO.db_3.18.0
[9] gprofiler2_0.2.3 Gviz_1.46.1
[11] clusterProfiler_4.10.1 pathview_1.42.0
[13] DEGreport_1.38.5 apeglm_1.24.0
[15] EnhancedVolcano_1.20.0 ggrepel_0.9.5
[17] ComplexHeatmap_2.18.0 pheatmap_1.0.12
[19] vsn_3.70.0 circlize_0.4.16
[21] biomaRt_2.58.2 tximportData_1.30.0
[23] tximport_1.30.0 sva_3.50.0
[25] BiocParallel_1.36.0 genefilter_1.84.0
[27] mgcv_1.8-42 nlme_3.1-162
[29] RColorBrewer_1.1-3 gplots_3.1.3.1
[31] GenomicFeatures_1.54.4 annotate_1.80.0
[33] XML_3.99-0.17 AnnotationDbi_1.64.1
[35] lmerTest_3.1-3 lme4_1.1-35.5
[37] Matrix_1.6-5 Rmisc_1.5.1
[39] reshape2_1.4.4 lmtest_0.9-40
[41] zoo_1.8-12 plyr_1.8.9
[43] data.table_1.15.4 MASS_7.3-60
[45] adegenet_2.1.10 ade4_1.7-22
[47] rgl_1.3.1 arrayQualityMetrics_3.58.0
[49] GGally_2.2.1 vegan_2.6-6.1
[51] lattice_0.21-8 permute_0.9-7
[53] ape_5.8 lubridate_1.9.3
[55] forcats_1.0.0 stringr_1.5.1
[57] dplyr_1.1.3 purrr_1.0.2
[59] readr_2.1.5 tidyr_1.3.1
[61] tibble_3.2.1 ggplot2_3.5.1
[63] tidyverse_2.0.0 edgeR_4.0.16
[65] limma_3.58.1 DESeq2_1.42.0
[67] SummarizedExperiment_1.32.0 Biobase_2.62.0
[69] MatrixGenerics_1.14.0 matrixStats_1.2.0
[71] GenomicRanges_1.54.1 GenomeInfoDb_1.38.5
[73] IRanges_2.36.0 S4Vectors_0.40.2
[75] BiocGenerics_0.48.1
loaded via a namespace (and not attached):
[1] dichromat_2.0-0.1 vroom_1.6.5
[3] progress_1.2.3 nnet_7.3-19
[5] Biostrings_2.70.1 affyPLM_1.78.0
[7] vctrs_0.6.4 corpcor_1.6.10
[9] digest_0.6.33 png_0.1-8
[11] shape_1.4.6.1 deldir_2.0-4
[13] reshape_0.8.9 httpuv_1.6.12
[15] foreach_1.5.2 qvalue_2.34.0
[17] withr_3.0.0 ggfun_0.1.5
[19] psych_2.4.6.26 xfun_0.46
[21] survival_3.5-5 memoise_2.0.1
[23] hexbin_1.28.3 gson_0.1.0
[25] systemfonts_1.1.0 tidytree_0.4.6
[27] GlobalOptions_0.1.2 gtools_3.9.5
[29] KEGGgraph_1.62.0 logging_0.10-108
[31] Formula_1.2-5 prettyunits_1.2.0
[33] KEGGREST_1.42.0 promises_1.2.1
[35] httr_1.4.7 restfulr_0.0.15
[37] rstudioapi_0.16.0 generics_0.1.3
[39] DOSE_3.28.2 base64enc_0.1-3
[41] curl_5.2.0 zlibbioc_1.48.0
[43] ggraph_2.2.1 polyclip_1.10-7
[45] GenomeInfoDbData_1.2.11 SparseArray_1.2.3
[47] RBGL_1.78.0 xtable_1.8-4
[49] doParallel_1.0.17 evaluate_0.24.0
[51] S4Arrays_1.2.0 BiocFileCache_2.10.2
[53] preprocessCore_1.64.0 hms_1.1.3
[55] colorspace_2.1-0 filelock_1.0.3
[57] flexmix_2.3-19 magrittr_2.0.3
[59] Rgraphviz_2.46.0 modeltools_0.2-23
[61] ggtree_3.10.1 viridis_0.6.5
[63] later_1.3.1 SparseM_1.84-2
[65] shadowtext_0.1.4 cowplot_1.1.3
[67] Hmisc_5.1-3 pillar_1.9.0
[69] iterators_1.0.14 caTools_1.18.2
[71] compiler_4.3.1 stringi_1.8.4
[73] minqa_1.2.7 GenomicAlignments_1.38.2
[75] crayon_1.5.3 abind_1.4-5
[77] BiocIO_1.12.0 ggdendro_0.2.0
[79] gridGraphics_0.5-1 gcrma_2.74.0
[81] emdbook_1.3.13 locfit_1.5-9.8
[83] graphlayouts_1.1.1 org.Hs.eg.db_3.18.0
[85] bit_4.0.5 fastmatch_1.1-4
[87] codetools_0.2-19 openssl_2.2.0
[89] biovizBase_1.50.0 GetoptLong_1.0.5
[91] plotly_4.10.4 mime_0.12
[93] splines_4.3.1 Rcpp_1.0.11
[95] dbplyr_2.5.0 HDO.db_0.99.1
[97] Rttf2pt1_1.3.12 interp_1.1-6
[99] knitr_1.48 blob_1.2.4
[101] utf8_1.2.4 clue_0.3-65
[103] AnnotationFilter_1.26.0 fs_1.6.4
[105] checkmate_2.3.2 ggplotify_0.1.2
[107] statmod_1.5.0 tzdb_0.4.0
[109] svglite_2.1.3 tweenr_2.0.3
[111] pkgconfig_2.0.3 tools_4.3.1
[113] cachem_1.0.8 RSQLite_2.3.4
[115] viridisLite_0.4.2 DBI_1.2.3
[117] numDeriv_2016.8-1.1 fastmap_1.1.1
[119] rmarkdown_2.27 scales_1.3.0
[121] gridSVG_1.7-5 Rsamtools_2.18.0
[123] broom_1.0.6 patchwork_1.2.0
[125] coda_0.19-4.1 BiocManager_1.30.23
[127] ggstats_0.6.0 VariantAnnotation_1.48.1
[129] graph_1.80.0 rpart_4.1.19
[131] farver_2.1.2 scatterpie_0.2.3
[133] tidygraph_1.3.1 yaml_2.3.7
[135] latticeExtra_0.6-30 foreign_0.8-84
[137] rtracklayer_1.62.0 illuminaio_0.44.0
[139] cli_3.6.1 lifecycle_1.0.4
[141] askpass_1.2.0 mvtnorm_1.2-5
[143] lambda.r_1.2.4 backports_1.5.0
[145] timechange_0.3.0 gtable_0.3.5
[147] rjson_0.2.21 seqinr_4.2-36
[149] parallel_4.3.1 jsonlite_1.8.8
[151] bitops_1.0-7 bit64_4.0.5
[153] yulab.utils_0.1.5 base64_2.0.1
[155] futile.options_1.0.1 bdsmatrix_1.3-7
[157] GOSemSim_2.28.1 lazyeval_0.2.2
[159] setRNG_2024.2-1 shiny_1.8.1.1
[161] ConsensusClusterPlus_1.66.0 htmltools_0.5.8.1
[163] affy_1.80.0 enrichplot_1.22.0
[165] formatR_1.14 rappdirs_0.3.3
[167] ensembldb_2.26.0 glue_1.6.2
[169] XVector_0.42.0 RCurl_1.98-1.14
[171] treeio_1.26.0 BSgenome_1.70.2
[173] mnormt_2.1.1 jpeg_0.1-10
[175] gridExtra_2.3 igraph_2.0.3
[177] extrafontdb_1.0 R6_2.5.1
[179] BeadDataPackR_1.54.0 labeling_0.4.3
[181] cluster_2.1.4 bbmle_1.0.25.1
[183] beadarray_2.52.0 aplot_0.2.3
[185] nloptr_2.1.1 ProtGenerics_1.34.0
[187] DelayedArray_0.28.0 tidyselect_1.2.1
[189] htmlTable_2.4.3 ggforce_0.4.2
[191] xml2_1.3.6 munsell_0.5.1
[193] KernSmooth_2.23-21 topGO_2.54.0
[195] affyio_1.72.0 htmlwidgets_1.6.4
[197] fgsea_1.28.0 hwriter_1.3.2.1
[199] rlang_1.1.1 extrafont_0.19
[201] fansi_1.0.5