Question: Tximport of quasi-alignment-based RNAseq quantification (Salmon) for use with edgeR
0
gravatar for martin.weihrauch
11 months ago by
martin.weihrauch10 wrote:

I used Salmon for quasi-alignment-based quantification of my RNAseq data (mouse, mm10, raw fastq-files). I imported the generated quant.sf data files into R for use with edgeR, following the manual here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

Here is my code: 

# Importing the files with tximport

txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)

head(txi.salmon$counts)


#edgeR import
cts <- txi.salmon$counts
rownames(cts) <- gsub(rownames(cts), pattern = "\\.[0-9]+$", replacement = "") # Remove transcript version .**
normMat <- txi.salmon$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
# y is now ready for estimate dispersion functions see edgeR User's Guide

 

My question is now whether I can proceed with the generated y object as I would usually do, namely in the following way:

 

# Pre-filtering edgeR

y$samples
keep <- rowSums(cpm(y) > 1) >= 2 # Outfilter low-count/unexpressed genes (more than 1 CPM in at least 2 libraries; smallest group size is 3)
summary(keep)
y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions
y <- calcNormFactors(y, method = "TMM") # Calculate the library normalization factors

y <- estimateDisp(y, design, robust = TRUE)
fit <- glmQLFit(y, design, robust = TRUE)

 

Is it correct to filter in this way and calculate the library normalization factors as I do here? Does edgeR "know" to use the y$offset information that was put in?

I am unsure as calcNormFactors was already called in the tximport part before. 

 

 

 

 

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /scicore/soft/apps/OpenBLAS/0.2.13-GCC-4.8.4-LAPACK-3.5.0/lib/libopenblas_prescottp-r0.2.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RSQLite_2.0            STRINGdb_1.18.0        rafalib_1.0.0          edgeR_3.20.9          
 [5] limma_3.34.9           biomaRt_2.34.2         tximport_1.6.0         GenomicFeatures_1.30.0
 [9] AnnotationDbi_1.40.0   Biobase_2.38.0         GenomicRanges_1.30.0   GenomeInfoDb_1.14.0   
[13] IRanges_2.12.0         S4Vectors_0.16.0       BiocGenerics_0.24.0    tximportData_1.6.0    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13               locfit_1.5-9.1             lattice_0.20-35           
 [4] prettyunits_1.0.2          png_0.1-7                  Rsamtools_1.30.0          
 [7] Biostrings_2.46.0          gtools_3.5.0               assertthat_0.2.0          
[10] digest_0.6.12              chron_2.3-51               R6_2.2.2                  
[13] plyr_1.8.4                 sqldf_0.4-11               httr_1.3.1                
[16] gplots_3.0.1               zlibbioc_1.24.0            rlang_0.1.4               
[19] progress_1.1.2             curl_3.0                   gdata_2.18.0              
[22] blob_1.1.0                 Matrix_1.2-12              hash_2.2.6                
[25] gsubfn_0.7                 proto_1.0.0                splines_3.4.2             
[28] RMySQL_0.10.13             BiocParallel_1.12.0        statmod_1.4.30            
[31] readr_1.1.1                stringr_1.2.0              igraph_1.1.2              
[34] RCurl_1.95-4.8             bit_1.1-12                 DelayedArray_0.4.1        
[37] compiler_3.4.2             rtracklayer_1.38.0         pkgconfig_2.0.1           
[40] tcltk_3.4.2                SummarizedExperiment_1.8.0 tibble_1.3.4              
[43] GenomeInfoDbData_0.99.1    matrixStats_0.52.2         XML_3.98-1.9              
[46] GenomicAlignments_1.14.1   bitops_1.0-6               grid_3.4.2                
[49] DBI_0.7                    magrittr_1.5               KernSmooth_2.23-15        
[52] stringi_1.1.6              XVector_0.18.0             rjson_0.2.15              
[55] RColorBrewer_1.1-2         tools_3.4.2                bit64_0.9-7               
[58] hms_0.4.2                  plotrix_3.6-6              yaml_2.1.14               
[61] caTools_1.17.1             memoise_1.1.0  
rnaseq edger salmon tximport • 329 views
ADD COMMENTlink modified 11 months ago by Aaron Lun23k • written 11 months ago by martin.weihrauch10
Answer: Tximport of quasi-alignment-based RNAseq quantification (Salmon) for use with ed
2
gravatar for Aaron Lun
11 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

I don't use tximport so I can't speak for that particular section of your code, but otherwise, relevant edgeR functions will automatically detect and use whatever is present in the $offset field. However, we advise using scaleOffset when assigning an offset matrix to a DGEList. This simply adjusts the mean offset per gene so that the offsets are interpretable on the same scale as the log-library sizes, which ensures that the log-fold changes from glmQLFTest are not distorted.

It is also fine to re-compute the normalization factors here. If an offset matrix is available in the DGEList, the normalization factors will not be used to compute the offsets for GLM fitting, so there is no chance of "double normalization".

ADD COMMENTlink written 11 months ago by Aaron Lun23k

Thanks, Aaron. Would I use scaleOffset() like this:

y$offset <- scaleOffset(y = y, offset = t(t(log(normMat)) + o))
# y is now ready for estimate dispersion functions see edgeR User's Guide

 

Or should I do this:

offset <- t(t(log(normMat)) + o)
y <- scaleOffset(y, offset)

Are you sure scaleOffset() is necessary here, as I am unsure whether the calculations in the tximport bit are taking care of this already:

normMat <- txi.salmon$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)

 

 

ADD REPLYlink modified 11 months ago • written 11 months ago by martin.weihrauch10

The latter call scaleOffset(y, offset) would be sufficient.

As for your second question: possibly not strictly necessary, as it seems that the addition of the o serves to scale the offsets already. However, using scaleOffset does no harm and protects your code against future changes to the name of the offset field.

ADD REPLYlink written 11 months ago by Aaron Lun23k

Perfect, thanks a lot.

ADD REPLYlink written 11 months ago by martin.weihrauch10

Thanks Aaron, I can take a look and if it seems like we're doing the same thing outside of an edgeR function, I'll just swap it out for scaleOffset.

I think this function came into edgeR in Bioc 3.4 and the tximport vignette code is from Bioc 3.3, but I didn't realize that I could swap it out until now.

ADD REPLYlink written 11 months ago by Michael Love23k

I don't think it does the same thing, as scaleOffset will add/subtract a constant value per-row so that the row-average of the offsets is equal to the average log-library size. The tximport-related code above will alter the offset for each library with + o, so it's not a like-for-like replacement for that step.

Nonetheless, I'd still recommend using scaleOffset over a direct assignment to $offset, if for no other reason than the fact that users don't have to remember whether to use $offset or $offsets.

ADD REPLYlink written 11 months ago by Aaron Lun23k

Ok so then I’ll just update how we insert the thing we had already so it’s more edgeR API friendly. Thanks

ADD REPLYlink written 11 months ago by Michael Love23k
Answer: Tximport of quasi-alignment-based RNAseq quantification (Salmon) for use with ed
0
gravatar for James W. MacDonald
11 months ago by
United States
James W. MacDonald50k wrote:

It shouldn't matter if you run calcNormFactors. There is a function in edgeR called getOffset:

> getOffset
function (y)
{
    offset <- y$offset
    lib.size <- y$samples$lib.size
    norm.factors <- y$samples$norm.factors
    if (!is.null(offset)) {
        return(offset)
    }
    else {
        if (!is.null(norm.factors))
            lib.size <- lib.size * norm.factors
        return(log(lib.size))
    }
}
<bytecode: 0x00000000201ae308>
<environment: namespace:edgeR>

And as you can see, if there is an offset matrix already, that will be used. Otherwise an offset will be computed using the library size and normalization factors.

ADD COMMENTlink written 11 months ago by James W. MacDonald50k

So that is how it works! Thanks for your help.

ADD REPLYlink written 11 months ago by martin.weihrauch10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 117 users visited in the last hour