Question

processing of (salmon) gene counts for eqtl analysis

0

Entering edit mode

VSM ▴ 120

@vsm-6019

Last seen 7.7 years ago

United Kingdom

Hi there

I have a question about the appropriate salmon-generated gene counts, which take into account gene length, for the purpose of running eqtl analysis. I would much appreciate your suggestions on this.

I used tximport to obtain gene-level counts/abundance estimates for our RNAseq data (estimated using default countsFromAbundance=no). The downstream steps involve running TMM normalisation (to obtain normalised counts), rank transformation and batch correction, before running eqtl analysis.

With this approach, I am concerned that the counts do not take into account the length of genes. Your document here :
https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html
section : Use with downstream Bioconductor differential expression packages

outlines two ways in which one can get the counts ready for DE analysis, by taking into account gene/transcript length. I have a couple of questions:

Do the two methods work in a similar way, and can either be applied to edgeR/DEseq2/limma-voom?

If yes, would counts generated with tximport's countsFromAbundance = "lengthScaledTPM" be the appropriate ones to use for my eqtl pipeline?

If no, would the block of code under edgeR be appropriate to apply for my eqtl processing pipeline too ? I use the output of calcNormFactors to modify my counts. I don't know if/where I should use the offset, or if running calcNormFactors(cts/normMat) would do what I need?

Thanks for your help!

salmon tximport gene length eqtl TMM • 2.9k views

ADD COMMENT • link updated 7.7 years ago by Michael Love 43k • written 7.7 years ago by VSM ▴ 120

Michael Love · Accepted Answer · 2018-04-25

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

"Do the two methods work in a similar way, and can either be applied to edgeR/DESeq2/limma-voom?"

This is covered in that section of the tximport vignette, but I can review:

You can use edgeR and DESeq2 with the counts plus an offset, OR you can use the countsFromAbundance technique and pass the "counts" matrix to the methods as if they were the original counts. You cannot use the offset with limma-voom because it doesn't accept offsets (or it didn't when we last checked). So for limma-voom you have to use the countsFromAbundance technique.

For passing normalized counts to an eQTL pipeline, if you use DESeqDataSetFromTximport, it will also correct for gene length when you ask for normalized counts. I believe edgeR would as well, but I'm less familiar with the code to get this.

E.g. for DESeq2:

txi <- tximport(files, type="salmon", tx2gene=tx2gene)
coldata <- data.frame(x=seq_along(files)) # dummy coldata...
dds <- DESeqDataSetFromTximport(txi, coldata, ~1)
dds <- estimateSizeFactors(dds)
ncts <- counts(dds, normalized=TRUE)

The second to last line will say "using 'avgTxLength' from assays(dds), correcting for library size", letting you know that the gene lengths were taken into account when estimating library size correction and both are then used to create the normalized counts.

The meaning of elements of this final matrix are: the count that a given sample would have had if it had been sequenced at a similar depth to the "middle" sample in terms of sequencing depth, and if the gene length for that sample was similar to the "middle" sample in terms of gene length.

ADD COMMENT • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael. I still have a few issues ...

1. If I re-import with DESeqDataSetFromTximport, can I supply TMM normalisation factors to your code snippet? (I know they are similar but I guess not identical (?), and I want to stay consistent with previous analyses I run). I don't know what (if any) between sample normalisation counts(.., normalized=TRUE) performs ..

tmm_factors <- calcNormFactors(cts, method="TMM")
...
dds <- estimateSizeFactors(dds) # supply TMM-derived size factors here?
...

2. If I want to estimate the offset, and apply it to modify my counts (with TMM normalisation factors and gene length), would this code estimate offset by taking into account TMM normalisation factors and gene length?

# tximport with default params, ie countsFromAbundance = "no"
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))

library(edgeR)
o <- log(calcNormFactors(cts/normMat, method="TMM")) + log(colSums(cts/normMat))
d <- DGEList(cts)
d$offset <- t(t(log(normMat)) + o)

if yes, how do I actually apply this offset to modify my counts? Previously, I ran the following to get TMM-modified counts, but didn't take info account gene length:

d<-DGEList(cts)
d<-calcNormFactors(d,method =c("TMM"))
cts_norm <- cpm(d, normalized.lib.sizes=TRUE) # this command doesn't use d$offset ..

ADD REPLY • link updated 7.7 years ago by Michael Love 43k • written 7.7 years ago by VSM ▴ 120

0

Entering edit mode

Sure, in general it's not that hard to use TMM here. To calculate something like a DESeq2 "size factor", you just multiply edgeR's output of calcNormFactors by the column sum, and then scale it to be centered around 1:

sf <- calcNormFactors(x) * colSums(x)
sf <- sf / median(sf)

Now you want normalized counts with TMM normalization and gene length divided out. So you want to obtain the size factor after dividing the gene length matrix out, i.e. when x = cts/normMat.

cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
x <- cts/normMat
sf <- calcNormFactors(x) * colSums(x)
sf <- sf / median(sf)

Finally, rather than stuff these pieces into a DESeqDataSet, you can skip that and just divide the size factor out of the gene-length-normalized matrix:

ncts <- t(t(x) / sf)