Is there any point to TMM-normalize TPM prior to limma-voom?
1
0
Entering edit mode
johnmcma ▴ 10
@johnmcma-12132
Last seen 23 months ago
United States

The current tximport vignette recommends the following workflow for limma-voom input:

files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")

names(files) <- paste0("sample", 1:6)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv,
    countsFromAbundance = "lengthScaledTPM")
library(limma)
y <- DGEList(txi$counts)
y <- calcNormFactors(y)
design <- model.matrix(~condition, data = sampleTable)
v <- voom(y, design)

I'm a bit confused with the logic here. I have read the tximport source code, and noticed the "lengthScaledTPM" mode pretty much does what it is--giving out the length-scaled TPM (or FPKM in some cases) as if it was read counts. I understand TMM normalization (or DESeq2's normalization for that matter) is designed for raw reads, rather than length-normalized metrics such as TPM. So what's the point in running calcNormFactors?

 

voom limma-voom tximport edger • 4.7k views
ADD COMMENT
5
Entering edit mode
@ryan-c-thompson-5618
Last seen 8 months ago
Scripps Research, La Jolla, CA

tximport should always giving you a matrix of values on a raw count scale. The only difference is whether it takes the estimated counts reported directly by Kallisto or whether it starts from the TPM values and transforms them back to counts. So either way you have a count matrix, not a TPM matrix. Regardless, I see no reason that either TMM or DESeq's normalization shouldn't work on TPM values. It's still preferable to run it on the counts, but my guess is that the difference would be minimal in practice.

By the way, neither normalization methods is terrible complicated, so I encourage you to read the papers and understand the methods so you will be able to decide for yourself when they are applicable.

ADD COMMENT
1
Entering edit mode

FYI, the precision weights from TMM are calculated from a binomial distribution, and won't make much sense for non-count data. It probably doesn't do much harm, either, but it can be turned off with doWeighting=FALSE.

ADD REPLY
0
Entering edit mode

Well, by triggering the argument countsFromAbundance, the counts slot of the tximport output is derived using the abundance measure named in the command (by default, TPM for the fishes and FPKM for RSEM). I'm citing the code in question below:

makeCountsFromAbundance <- function(countsMat, abundanceMat, lengthMat, countsFromAbundance) {
  countsSum <- colSums(countsMat)
  if (countsFromAbundance == "lengthScaledTPM") {
    newCounts <- abundanceMat * rowMeans(lengthMat)
  } else if (countsFromAbundance == "scaledTPM") {
    newCounts <- abundanceMat
  } else {
    stop("expecting 'lengthScaledTPM' or 'scaledTPM'")
  }
  newSum <- colSums(newCounts)
  countsMat <- t(t(newCounts) * (countsSum/newSum))
  countsMat
}
ADD REPLY
0
Entering edit mode

So, the TPM matrix the columns sum to 1e6. The matrices tximport returns in the "counts" slot have column sum equal to the number of fragments aligned to transcripts by the quantifiers. That's what Ryan means by "you have a count matrix, not a TPM matrix". 

You can think of the countsFromAbundance counts matrix as "regressing out" differences in counts due to changes in gene length across samples (from DTU), instead of using an offset to account for this.

ADD REPLY

Login before adding your answer.

Traffic: 770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6