Question: processing of (salmon) gene counts for eqtl analysis
0
gravatar for VSM
12 months ago by
VSM120
United Kingdom
VSM120 wrote:

Hi there

I have a question about the appropriate salmon-generated gene counts, which take into account gene length, for the purpose of running eqtl analysis. I would much appreciate your suggestions on this.

I used tximport to obtain gene-level counts/abundance estimates for our RNAseq data (estimated using default countsFromAbundance=no). The downstream steps involve running TMM normalisation (to obtain normalised counts), rank transformation and batch correction, before running eqtl analysis.

With this approach, I am concerned that the counts do not take into account the length of genes. Your document here :
https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html
section : Use with downstream Bioconductor differential expression packages

outlines two ways in which one can get the counts ready for DE analysis, by taking into account gene/transcript length. I have a couple of questions:

Do the two methods work in a similar way, and can either be applied to edgeR/DEseq2/limma-voom?

If yes, would counts generated with tximport's countsFromAbundance = "lengthScaledTPM" be the appropriate ones to use for my eqtl pipeline?

If no, would the block of code under edgeR be appropriate to apply for my eqtl processing pipeline too ? I use the output of calcNormFactors to modify my counts. I don't know if/where I should use the offset, or if running calcNormFactors(cts/normMat) would do what I need?

Thanks for your help!

ADD COMMENTlink modified 12 months ago by Michael Love23k • written 12 months ago by VSM120
Answer: processing of (salmon) gene counts for eqtl analysis
1
gravatar for Michael Love
12 months ago by
Michael Love23k
United States
Michael Love23k wrote:

"Do the two methods work in a similar way, and can either be applied to edgeR/DESeq2/limma-voom?"

This is covered in that section of the tximport vignette, but I can review:

You can use edgeR and DESeq2 with the counts plus an offset, OR you can use the countsFromAbundance technique and pass the "counts" matrix to the methods as if they were the original counts. You cannot use the offset with limma-voom because it doesn't accept offsets (or it didn't when we last checked). So for limma-voom you have to use the countsFromAbundance technique.

For passing normalized counts to an eQTL pipeline, if you use DESeqDataSetFromTximport, it will also correct for gene length when you ask for normalized counts. I believe edgeR would as well, but I'm less familiar with the code to get this.

E.g. for DESeq2:

txi <- tximport(files, type="salmon", tx2gene=tx2gene)
coldata <- data.frame(x=seq_along(files)) # dummy coldata...
dds <- DESeqDataSetFromTximport(txi, coldata, ~1)
dds <- estimateSizeFactors(dds)
ncts <- counts(dds, normalized=TRUE)

The second to last line will say "using 'avgTxLength' from assays(dds), correcting for library size", letting you know that the gene lengths were taken into account when estimating library size correction and both are then used to create the normalized counts.

The meaning of elements of this final matrix are: the count that a given sample would have had if it had been sequenced at a similar depth to the "middle" sample in terms of sequencing depth, and if the gene length for that sample was similar to the "middle" sample in terms of gene length.

ADD COMMENTlink written 12 months ago by Michael Love23k

Thanks Michael. I still have a few issues ...

1. If I re-import with DESeqDataSetFromTximport, can I supply TMM normalisation factors to your code snippet? (I know they are similar but I guess not identical (?), and I want to stay consistent with previous analyses I run). I don't know what (if any) between sample normalisation counts(.., normalized=TRUE) performs ..

tmm_factors <- calcNormFactors(cts, method="TMM")
...
dds <- estimateSizeFactors(dds) # supply TMM-derived size factors here?
...

2. If I want to estimate the offset, and apply it to modify my counts (with TMM normalisation factors and gene length), would this code estimate offset by taking into account TMM normalisation factors and gene length?

# tximport with default params, ie countsFromAbundance = "no"
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))

library(edgeR)
o <- log(calcNormFactors(cts/normMat, method="TMM")) + log(colSums(cts/normMat))
d <- DGEList(cts)
d$offset <- t(t(log(normMat)) + o)
  • if yes, how do I actually apply this offset to modify my counts? Previously, I ran the following to get TMM-modified counts, but didn't take info account gene length:
d<-DGEList(cts)
d<-calcNormFactors(d,method =c("TMM"))
cts_norm <- cpm(d, normalized.lib.sizes=TRUE) # this command doesn't use d$offset ..
ADD REPLYlink modified 11 months ago by Michael Love23k • written 11 months ago by VSM120

Sure, in general it's not that hard to use TMM here. To calculate something like a DESeq2 "size factor", you just multiply edgeR's output of calcNormFactors by the column sum, and then scale it to be centered around 1:

sf <- calcNormFactors(x) * colSums(x)
sf <- sf / median(sf)

Now you want normalized counts with TMM normalization and gene length divided out. So you want to obtain the size factor after dividing the gene length matrix out, i.e. when x = cts/normMat.

cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
x <- cts/normMat
sf <- calcNormFactors(x) * colSums(x)
sf <- sf / median(sf)

Finally, rather than stuff these pieces into a DESeqDataSet, you can skip that and just divide the size factor out of the gene-length-normalized matrix:

ncts <- t(t(x) / sf)
ADD REPLYlink written 11 months ago by Michael Love23k

Michael, that is super helpful. Thank you!!

ADD REPLYlink written 11 months ago by VSM120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 162 users visited in the last hour