Question

between lane normalization upper quantile

0

Entering edit mode

elesofi • 0

@30f67af1

Last seen 9 months ago

Belgium

Hi

I'm comparing the outputs of the betweenLaneNormalization(which="upper") function and the calcNormFactors(method="upperquartile") function. Technically both functions are based on the same method from Bullard. But I get completely opposite library sizes after I perform the normalization. Which one is right? Am I doing something wrong?

library(RUVSeq)
library(edgeR)
library(zebrafishRNASeq)
data(zfGenes)

filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
genes <- rownames(filtered)[grep("^ENS", rownames(filtered))]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]

x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
                           phenoData = data.frame(x, row.names=colnames(filtered)),)

set <- betweenLaneNormalization(set, which="upper",offset = T)
![enter image description here][1]
libsizes1 <- colSums(set@assayData$normalizedCounts)

design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")

libsizes2 <- y$samples$lib.size*y$samples$norm.factors

plot(libsizes1,libsizes2)

    sessionInfo( )
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
 [1] zebrafishRNASeq_1.18.0      RUVSeq_1.32.0              
 [3] EDASeq_2.32.0               ShortRead_1.56.1           
 [5] GenomicAlignments_1.34.1    SummarizedExperiment_1.28.0
 [7] MatrixGenerics_1.10.0       matrixStats_1.2.0          
 [9] Rsamtools_2.14.0            GenomicRanges_1.50.2       
[11] Biostrings_2.66.0           GenomeInfoDb_1.34.9        
[13] XVector_0.38.0              IRanges_2.32.0             
[15] S4Vectors_0.36.2            BiocParallel_1.32.6        
[17] Biobase_2.58.0              BiocGenerics_0.44.0        
[19] edgeR_3.40.2                limma_3.54.2               

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           bit64_4.0.5           
 [3] filelock_1.0.3         RColorBrewer_1.1-3    
 [5] progress_1.2.3         httr_1.4.7            
 [7] tools_4.2.0            utf8_1.2.4            
 [9] R6_2.5.1               DBI_1.2.2             
[11] tidyselect_1.2.0       prettyunits_1.2.0     
[13] bit_4.0.5              curl_5.2.1            
[15] compiler_4.2.0         cli_3.6.2             
[17] xml2_1.3.6             DelayedArray_0.24.0   
[19] rtracklayer_1.58.0     rappdirs_0.3.3        
[21] stringr_1.5.1          digest_0.6.34         
[23] rmarkdown_2.25         R.utils_2.12.3        
[25] jpeg_0.1-10            pkgconfig_2.0.3       
[27] htmltools_0.5.7        dbplyr_2.4.0          
[29] fastmap_1.1.1          rlang_1.1.3           
[31] rstudioapi_0.15.0      RSQLite_2.3.5         
[33] BiocIO_1.8.0           generics_0.1.3        
[35] hwriter_1.3.2.1        dplyr_1.1.4           
[37] R.oo_1.26.0            RCurl_1.98-1.14       
[39] magrittr_2.0.3         GenomeInfoDbData_1.2.9
[41] interp_1.1-6           Matrix_1.6-5          
[43] Rcpp_1.0.12            fansi_1.0.6           
[45] lifecycle_1.0.4        R.methodsS3_1.8.2     
[47] stringi_1.8.3          yaml_2.3.8            
[49] MASS_7.3-56            zlibbioc_1.44.0       
[51] BiocFileCache_2.6.1    grid_4.2.0            
[53] blob_1.2.4             parallel_4.2.0        
[55] crayon_1.5.2           deldir_2.0-4          
[57] lattice_0.20-45        GenomicFeatures_1.50.4
[59] hms_1.1.3              KEGGREST_1.38.0       
[61] locfit_1.5-9.9         knitr_1.45            
[63] pillar_1.9.0           rjson_0.2.21          
[65] codetools_0.2-18       biomaRt_2.54.1        
[67] XML_3.99-0.16.1        glue_1.7.0            
[69] evaluate_0.23          latticeExtra_0.6-30   
[71] png_0.1-8              vctrs_0.6.5           
[73] cachem_1.0.8           xfun_0.42             
[75] aroma.light_3.28.0     restfulr_0.0.15       
[77] tibble_3.2.1           AnnotationDbi_1.60.2  
[79] memoise_2.0.1

EDASeq edgeR • 596 views

ADD COMMENT • link 9 months ago elesofi • 0

score 1 · Answer 1 · 2024-03-14

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 16 hours ago

WEHI, Melbourne, Australia

You seem to be assuming that you can simply add up the normalizedCounts values from betweenLaneNormalization to get normalized library sizes. I am not familiar with EDASeq, but I suspect that assumption might not be correct. The normalizedCounts values might perhaps be counts-per-million, in which case the column sums wouldn't have any meaning. Unfortunately, the help page for betweenLaneNormalization doesn't give any information about the output object. According to Section 6.1 of the EDASeq vignette, you can extract log library sizes using the offset function.

ADD COMMENT • link 9 months ago Gordon Smyth 52k

1

Entering edit mode

Gordon is correct. The column sums of the normalizedCounts matrix are meaningless and not intended to be used as size factors.

You can use offst(set) to extract a matrix of offsets to be added to downstream negative binomial models (e.g., in edgeR's glmFit) as illustrated in the EDASeq vignette. Note that the offsets extracted from EDASeq need to be multiplied by -1 before using them in edgeR (see section 7.3 of the EDASeq vignette for details).

ADD REPLY • link 9 months ago davide risso ▴ 980