Question: Is there any point to TMM-normalize TPM prior to limma-voom?
gravatar for johnmcma
2.3 years ago by
johnmcma10 wrote:

The current tximport vignette recommends the following workflow for limma-voom input:

files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")

names(files) <- paste0("sample", 1:6)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv,
    countsFromAbundance = "lengthScaledTPM")
y <- DGEList(txi$counts)
y <- calcNormFactors(y)
design <- model.matrix(~condition, data = sampleTable)
v <- voom(y, design)

I'm a bit confused with the logic here. I have read the tximport source code, and noticed the "lengthScaledTPM" mode pretty much does what it is--giving out the length-scaled TPM (or FPKM in some cases) as if it was read counts. I understand TMM normalization (or DESeq2's normalization for that matter) is designed for raw reads, rather than length-normalized metrics such as TPM. So what's the point in running calcNormFactors?


edger voom tximport limma-voom • 2.2k views
ADD COMMENTlink modified 2.3 years ago by Ryan C. Thompson7.3k • written 2.3 years ago by johnmcma10
Answer: Is there any point to TMM-normalize TPM prior to limma-voom?
gravatar for Ryan C. Thompson
2.3 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.3k wrote:

tximport should always giving you a matrix of values on a raw count scale. The only difference is whether it takes the estimated counts reported directly by Kallisto or whether it starts from the TPM values and transforms them back to counts. So either way you have a count matrix, not a TPM matrix. Regardless, I see no reason that either TMM or DESeq's normalization shouldn't work on TPM values. It's still preferable to run it on the counts, but my guess is that the difference would be minimal in practice.

By the way, neither normalization methods is terrible complicated, so I encourage you to read the papers and understand the methods so you will be able to decide for yourself when they are applicable.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Ryan C. Thompson7.3k

FYI, the precision weights from TMM are calculated from a binomial distribution, and won't make much sense for non-count data. It probably doesn't do much harm, either, but it can be turned off with doWeighting=FALSE.

ADD REPLYlink written 2.3 years ago by Aaron Lun24k

Well, by triggering the argument countsFromAbundance, the counts slot of the tximport output is derived using the abundance measure named in the command (by default, TPM for the fishes and FPKM for RSEM). I'm citing the code in question below:

makeCountsFromAbundance <- function(countsMat, abundanceMat, lengthMat, countsFromAbundance) {
  countsSum <- colSums(countsMat)
  if (countsFromAbundance == "lengthScaledTPM") {
    newCounts <- abundanceMat * rowMeans(lengthMat)
  } else if (countsFromAbundance == "scaledTPM") {
    newCounts <- abundanceMat
  } else {
    stop("expecting 'lengthScaledTPM' or 'scaledTPM'")
  newSum <- colSums(newCounts)
  countsMat <- t(t(newCounts) * (countsSum/newSum))
ADD REPLYlink written 2.3 years ago by johnmcma10

So, the TPM matrix the columns sum to 1e6. The matrices tximport returns in the "counts" slot have column sum equal to the number of fragments aligned to transcripts by the quantifiers. That's what Ryan means by "you have a count matrix, not a TPM matrix". 

You can think of the countsFromAbundance counts matrix as "regressing out" differences in counts due to changes in gene length across samples (from DTU), instead of using an offset to account for this.

ADD REPLYlink written 2.3 years ago by Michael Love24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour