Tximport of quasi-alignment-based RNAseq quantification (Salmon) for use with edgeR
2
0
Entering edit mode
@martinweihrauch-13664
Last seen 3.6 years ago

I used Salmon for quasi-alignment-based quantification of my RNAseq data (mouse, mm10, raw fastq-files). I imported the generated quant.sf data files into R for use with edgeR, following the manual here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

Here is my code:

# Importing the files with tximport

txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)

head(txi.salmon$counts) #edgeR import cts <- txi.salmon$counts
rownames(cts) <- gsub(rownames(cts), pattern = "\\.[0-9]+$", replacement = "") # Remove transcript version .** normMat <- txi.salmon$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o) # y is now ready for estimate dispersion functions see edgeR User's Guide My question is now whether I can proceed with the generated y object as I would usually do, namely in the following way: # Pre-filtering edgeR y$samples
keep <- rowSums(cpm(y) > 1) >= 2 # Outfilter low-count/unexpressed genes (more than 1 CPM in at least 2 libraries; smallest group size is 3)
summary(keep)
y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions
y <- calcNormFactors(y, method = "TMM") # Calculate the library normalization factors

y <- estimateDisp(y, design, robust = TRUE)
fit <- glmQLFit(y, design, robust = TRUE)

Is it correct to filter in this way and calculate the library normalization factors as I do here? Does edgeR "know" to use the y$offset information that was put in? I am unsure as calcNormFactors was already called in the tximport part before. sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /scicore/soft/apps/OpenBLAS/0.2.13-GCC-4.8.4-LAPACK-3.5.0/lib/libopenblas_prescottp-r0.2.13.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] RSQLite_2.0 STRINGdb_1.18.0 rafalib_1.0.0 edgeR_3.20.9 [5] limma_3.34.9 biomaRt_2.34.2 tximport_1.6.0 GenomicFeatures_1.30.0 [9] AnnotationDbi_1.40.0 Biobase_2.38.0 GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 [13] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0 tximportData_1.6.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 locfit_1.5-9.1 lattice_0.20-35 [4] prettyunits_1.0.2 png_0.1-7 Rsamtools_1.30.0 [7] Biostrings_2.46.0 gtools_3.5.0 assertthat_0.2.0 [10] digest_0.6.12 chron_2.3-51 R6_2.2.2 [13] plyr_1.8.4 sqldf_0.4-11 httr_1.3.1 [16] gplots_3.0.1 zlibbioc_1.24.0 rlang_0.1.4 [19] progress_1.1.2 curl_3.0 gdata_2.18.0 [22] blob_1.1.0 Matrix_1.2-12 hash_2.2.6 [25] gsubfn_0.7 proto_1.0.0 splines_3.4.2 [28] RMySQL_0.10.13 BiocParallel_1.12.0 statmod_1.4.30 [31] readr_1.1.1 stringr_1.2.0 igraph_1.1.2 [34] RCurl_1.95-4.8 bit_1.1-12 DelayedArray_0.4.1 [37] compiler_3.4.2 rtracklayer_1.38.0 pkgconfig_2.0.1 [40] tcltk_3.4.2 SummarizedExperiment_1.8.0 tibble_1.3.4 [43] GenomeInfoDbData_0.99.1 matrixStats_0.52.2 XML_3.98-1.9 [46] GenomicAlignments_1.14.1 bitops_1.0-6 grid_3.4.2 [49] DBI_0.7 magrittr_1.5 KernSmooth_2.23-15 [52] stringi_1.1.6 XVector_0.18.0 rjson_0.2.15 [55] RColorBrewer_1.1-2 tools_3.4.2 bit64_0.9-7 [58] hms_0.4.2 plotrix_3.6-6 yaml_2.1.14 [61] caTools_1.17.1 memoise_1.1.0  salmon edger tximport rnaseq • 679 views ADD COMMENT 2 Entering edit mode Aaron Lun ★ 27k @alun Last seen 4 hours ago The city by the bay I don't use tximport so I can't speak for that particular section of your code, but otherwise, relevant edgeR functions will automatically detect and use whatever is present in the $offset field. However, we advise using scaleOffset when assigning an offset matrix to a DGEList. This simply adjusts the mean offset per gene so that the offsets are interpretable on the same scale as the log-library sizes, which ensures that the log-fold changes from glmQLFTest are not distorted.

It is also fine to re-compute the normalization factors here. If an offset matrix is available in the DGEList, the normalization factors will not be used to compute the offsets for GLM fitting, so there is no chance of "double normalization".

0
Entering edit mode

Thanks, Aaron. Would I use scaleOffset() like this:

y$offset <- scaleOffset(y = y, offset = t(t(log(normMat)) + o)) # y is now ready for estimate dispersion functions see edgeR User's Guide Or should I do this: offset <- t(t(log(normMat)) + o) y <- scaleOffset(y, offset) Are you sure scaleOffset() is necessary here, as I am unsure whether the calculations in the tximport bit are taking care of this already: normMat <- txi.salmon$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)

0
Entering edit mode

The latter call scaleOffset(y, offset) would be sufficient.

As for your second question: possibly not strictly necessary, as it seems that the addition of the o serves to scale the offsets already. However, using scaleOffset does no harm and protects your code against future changes to the name of the offset field.

0
Entering edit mode

Perfect, thanks a lot.

0
Entering edit mode

Thanks Aaron, I can take a look and if it seems like we're doing the same thing outside of an edgeR function, I'll just swap it out for scaleOffset.

I think this function came into edgeR in Bioc 3.4 and the tximport vignette code is from Bioc 3.3, but I didn't realize that I could swap it out until now.

0
Entering edit mode

I don't think it does the same thing, as scaleOffset will add/subtract a constant value per-row so that the row-average of the offsets is equal to the average log-library size. The tximport-related code above will alter the offset for each library with + o, so it's not a like-for-like replacement for that step.

Nonetheless, I'd still recommend using scaleOffset over a direct assignment to $offset, if for no other reason than the fact that users don't have to remember whether to use $offset or $offsets. ADD REPLY 0 Entering edit mode Ok so then I’ll just update how we insert the thing we had already so it’s more edgeR API friendly. Thanks ADD REPLY 0 Entering edit mode @james-w-macdonald-5106 Last seen 7 days ago United States It shouldn't matter if you run calcNormFactors. There is a function in edgeR called getOffset: > getOffset function (y) { offset <- y$offset
lib.size <- y$samples$lib.size
norm.factors <- y$samples$norm.factors
if (!is.null(offset)) {
return(offset)
}
else {
if (!is.null(norm.factors))
lib.size <- lib.size * norm.factors
return(log(lib.size))
}
}
<bytecode: 0x00000000201ae308>
<environment: namespace:edgeR>

And as you can see, if there is an offset matrix already, that will be used. Otherwise an offset will be computed using the library size and normalization factors.

0
Entering edit mode

So that is how it works! Thanks for your help.