#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: processing of (salmon) gene counts for eqtl analysis
0
10 months ago by
VSM120
United Kingdom
VSM120 wrote:

Hi there

I have a question about the appropriate salmon-generated gene counts, which take into account gene length, for the purpose of running eqtl analysis. I would much appreciate your suggestions on this.

I used tximport to obtain gene-level counts/abundance estimates for our RNAseq data (estimated using default countsFromAbundance=no). The downstream steps involve running TMM normalisation (to obtain normalised counts), rank transformation and batch correction, before running eqtl analysis.

With this approach, I am concerned that the counts do not take into account the length of genes. Your document here :
https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html
section : Use with downstream Bioconductor differential expression packages

outlines two ways in which one can get the counts ready for DE analysis, by taking into account gene/transcript length. I have a couple of questions:

Do the two methods work in a similar way, and can either be applied to edgeR/DEseq2/limma-voom?

If yes, would counts generated with tximport's countsFromAbundance = "lengthScaledTPM" be the appropriate ones to use for my eqtl pipeline?

If no, would the block of code under edgeR be appropriate to apply for my eqtl processing pipeline too ? I use the output of calcNormFactors to modify my counts. I don't know if/where I should use the offset, or if running calcNormFactors(cts/normMat) would do what I need?

modified 9 months ago by Michael Love22k • written 10 months ago by VSM120
Answer: processing of (salmon) gene counts for eqtl analysis
1
9 months ago by
Michael Love22k
United States
Michael Love22k wrote:

"Do the two methods work in a similar way, and can either be applied to edgeR/DESeq2/limma-voom?"

This is covered in that section of the tximport vignette, but I can review:

You can use edgeR and DESeq2 with the counts plus an offset, OR you can use the countsFromAbundance technique and pass the "counts" matrix to the methods as if they were the original counts. You cannot use the offset with limma-voom because it doesn't accept offsets (or it didn't when we last checked). So for limma-voom you have to use the countsFromAbundance technique.

For passing normalized counts to an eQTL pipeline, if you use DESeqDataSetFromTximport, it will also correct for gene length when you ask for normalized counts. I believe edgeR would as well, but I'm less familiar with the code to get this.

E.g. for DESeq2:

txi <- tximport(files, type="salmon", tx2gene=tx2gene)
coldata <- data.frame(x=seq_along(files)) # dummy coldata...
dds <- DESeqDataSetFromTximport(txi, coldata, ~1)
dds <- estimateSizeFactors(dds)
ncts <- counts(dds, normalized=TRUE)

The second to last line will say "using 'avgTxLength' from assays(dds), correcting for library size", letting you know that the gene lengths were taken into account when estimating library size correction and both are then used to create the normalized counts.

The meaning of elements of this final matrix are: the count that a given sample would have had if it had been sequenced at a similar depth to the "middle" sample in terms of sequencing depth, and if the gene length for that sample was similar to the "middle" sample in terms of gene length.

Thanks Michael. I still have a few issues ...

1. If I re-import with DESeqDataSetFromTximport, can I supply TMM normalisation factors to your code snippet? (I know they are similar but I guess not identical (?), and I want to stay consistent with previous analyses I run). I don't know what (if any) between sample normalisation counts(.., normalized=TRUE) performs ..

tmm_factors <- calcNormFactors(cts, method="TMM")
...
dds <- estimateSizeFactors(dds) # supply TMM-derived size factors here?
...

2. If I want to estimate the offset, and apply it to modify my counts (with TMM normalisation factors and gene length), would this code estimate offset by taking into account TMM normalisation factors and gene length?

# tximport with default params, ie countsFromAbundance = "no"
cts <- txi$counts normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))

library(edgeR)
o <- log(calcNormFactors(cts/normMat, method="TMM")) + log(colSums(cts/normMat))
d <- DGEList(cts)
d$offset <- t(t(log(normMat)) + o) • if yes, how do I actually apply this offset to modify my counts? Previously, I ran the following to get TMM-modified counts, but didn't take info account gene length: d<-DGEList(cts) d<-calcNormFactors(d,method =c("TMM")) cts_norm <- cpm(d, normalized.lib.sizes=TRUE) # this command doesn't use d$offset ..

Sure, in general it's not that hard to use TMM here. To calculate something like a DESeq2 "size factor", you just multiply edgeR's output of calcNormFactors by the column sum, and then scale it to be centered around 1:

sf <- calcNormFactors(x) * colSums(x)
sf <- sf / median(sf)


Now you want normalized counts with TMM normalization and gene length divided out. So you want to obtain the size factor after dividing the gene length matrix out, i.e. when x = cts/normMat.

cts <- txi$counts normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
x <- cts/normMat
sf <- calcNormFactors(x) * colSums(x)
sf <- sf / median(sf)


Finally, rather than stuff these pieces into a DESeqDataSet, you can skip that and just divide the size factor out of the gene-length-normalized matrix:

ncts <- t(t(x) / sf)