Hi
I'm comparing the outputs of the betweenLaneNormalization(which="upper") function and the calcNormFactors(method="upperquartile") function. Technically both functions are based on the same method from Bullard. But I get completely opposite library sizes after I perform the normalization. Which one is right? Am I doing something wrong?
library(RUVSeq)
library(edgeR)
library(zebrafishRNASeq)
data(zfGenes)
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
genes <- rownames(filtered)[grep("^ENS", rownames(filtered))]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]
x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
phenoData = data.frame(x, row.names=colnames(filtered)),)
set <- betweenLaneNormalization(set, which="upper",offset = T)
![enter image description here][1]
libsizes1 <- colSums(set@assayData$normalizedCounts)
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
libsizes2 <- y$samples$lib.size*y$samples$norm.factors
plot(libsizes1,libsizes2)
sessionInfo( )
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8
[2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] zebrafishRNASeq_1.18.0 RUVSeq_1.32.0
[3] EDASeq_2.32.0 ShortRead_1.56.1
[5] GenomicAlignments_1.34.1 SummarizedExperiment_1.28.0
[7] MatrixGenerics_1.10.0 matrixStats_1.2.0
[9] Rsamtools_2.14.0 GenomicRanges_1.50.2
[11] Biostrings_2.66.0 GenomeInfoDb_1.34.9
[13] XVector_0.38.0 IRanges_2.32.0
[15] S4Vectors_0.36.2 BiocParallel_1.32.6
[17] Biobase_2.58.0 BiocGenerics_0.44.0
[19] edgeR_3.40.2 limma_3.54.2
loaded via a namespace (and not attached):
[1] bitops_1.0-7 bit64_4.0.5
[3] filelock_1.0.3 RColorBrewer_1.1-3
[5] progress_1.2.3 httr_1.4.7
[7] tools_4.2.0 utf8_1.2.4
[9] R6_2.5.1 DBI_1.2.2
[11] tidyselect_1.2.0 prettyunits_1.2.0
[13] bit_4.0.5 curl_5.2.1
[15] compiler_4.2.0 cli_3.6.2
[17] xml2_1.3.6 DelayedArray_0.24.0
[19] rtracklayer_1.58.0 rappdirs_0.3.3
[21] stringr_1.5.1 digest_0.6.34
[23] rmarkdown_2.25 R.utils_2.12.3
[25] jpeg_0.1-10 pkgconfig_2.0.3
[27] htmltools_0.5.7 dbplyr_2.4.0
[29] fastmap_1.1.1 rlang_1.1.3
[31] rstudioapi_0.15.0 RSQLite_2.3.5
[33] BiocIO_1.8.0 generics_0.1.3
[35] hwriter_1.3.2.1 dplyr_1.1.4
[37] R.oo_1.26.0 RCurl_1.98-1.14
[39] magrittr_2.0.3 GenomeInfoDbData_1.2.9
[41] interp_1.1-6 Matrix_1.6-5
[43] Rcpp_1.0.12 fansi_1.0.6
[45] lifecycle_1.0.4 R.methodsS3_1.8.2
[47] stringi_1.8.3 yaml_2.3.8
[49] MASS_7.3-56 zlibbioc_1.44.0
[51] BiocFileCache_2.6.1 grid_4.2.0
[53] blob_1.2.4 parallel_4.2.0
[55] crayon_1.5.2 deldir_2.0-4
[57] lattice_0.20-45 GenomicFeatures_1.50.4
[59] hms_1.1.3 KEGGREST_1.38.0
[61] locfit_1.5-9.9 knitr_1.45
[63] pillar_1.9.0 rjson_0.2.21
[65] codetools_0.2-18 biomaRt_2.54.1
[67] XML_3.99-0.16.1 glue_1.7.0
[69] evaluate_0.23 latticeExtra_0.6-30
[71] png_0.1-8 vctrs_0.6.5
[73] cachem_1.0.8 xfun_0.42
[75] aroma.light_3.28.0 restfulr_0.0.15
[77] tibble_3.2.1 AnnotationDbi_1.60.2
[79] memoise_2.0.1

Gordon is correct. The column sums of the
normalizedCountsmatrix are meaningless and not intended to be used as size factors.You can use
offst(set)to extract a matrix of offsets to be added to downstream negative binomial models (e.g., inedgeR'sglmFit) as illustrated in the EDASeq vignette. Note that the offsets extracted from EDASeq need to be multiplied by -1 before using them in edgeR (see section 7.3 of the EDASeq vignette for details).