Question

TPM normalization starting with read counts

0

Entering edit mode

JAcky • 0

@d6b8183e

Last seen 18 months ago

Israel

Hello everyone

I have multiple bulk RNA-seq datasets that I need to apply the same pipe line on. I want to normalize them from counts data to TPM. In all datasets, I have the genes as rows, and samples as columns.

Unfortunately, I don't have the fastq files, all I have is the counts data and from there I need to get to TPM, so I cant use salmon, kallisto or any pckage that uses fastq files for that matter. To normalize, I need the length of the genes.

This is my code to get the lengths of the genes:

exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
insect = intersect(rownames(counts_df),names(len))
geneLengths = len[insect]
counts_df = counts_df[insect,]

And then I use the geneLengths in the normalization process:

rpkm <- apply(X = subset(counts_df),
               MARGIN = 2,
               FUN = function(x) {
                 10^9 * x / geneLengths / sum(as.numeric(x))
               })

TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()

However, this gives me a long list of warnings like this:

Warning messages:
1: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
6: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length

I have been told that edgeR has the ability to normalize, so I tried this also:

y <- DGEList(counts=counts_df , genes=data.frame(Length=geneLengths))
y <- calcNormFactors(y)
RPKM <- rpkm(y)

But this crashed pretty quickly, might be due to the big number of genes in a dataframe (more than 20,000 genes).

I have also been told the Deseq2 can normalize the data, but as far as I know Deseq2 normalizes in it's own way. Can Deseq2 nromalize to TPM data?

If not, can you guys please guide me in the normalization process and the code for that in r? there seems to be a bit of mess because without fastq files I can't know which isoform of each gene is being calculated.

edgeR DESeq2 • 4.4k views

ADD COMMENT • link updated 20 months ago by Gordon Smyth 50k • written 20 months ago by JAcky • 0

0

Entering edit mode

Neither edgeR nor DESeq2 output TPM directly. TPM is easy to calculate though if counts and appropriate length info is available (DESeq2: Is it possible to convert read counts to expression values via TPM and return these values?). Gene length is not so trivial to get though (how to calculate gene length to be used in rpkm() in edgeR) but people told you about this already over at biostars (https://www.biostars.org/p/9534546/).

ADD REPLY • link 20 months ago ATpoint ★ 4.0k

score 1 · Answer 1 · 2022-08-15

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

RPKM <- rpkm(y) But this crashed pretty quickly, might be due to the big number of genes in a dataframe (more than 20,000 genes).

rpkm() is extremely fast on 20,000 genes. It will take a fraction of a second even on millions on genes. The problem would seem from the warnings to be that you have input the wrong number of gene lengths.

Elementary checking would include examining dim(counts_df) and length(geneLengths).

However, in my opinion, unless you know the gene annotation that was used to create the read counts in the first place, it is not possible to compute rpkm or tpm in a meaningful fashion.

ADD COMMENT • link 20 months ago Gordon Smyth 50k

0

Entering edit mode

Gordon Smyth So you think my method in TPM normalization that I introduced above is not good? Is there any way you suggest to improve the normalization process?

By the way, the edgeR::rpkm() function worked, and it yields the exact same results as my method. So it didn't help much. And regarding the gene lengths, I googled the results of some of the gene lengths, they are correct. For example this gene: https://www.genecards.org/cgi-bin/carddisp.pl?gene=MIR3975

It has length of 70, I also got 70.

ADD REPLY • link 20 months ago JAcky • 0

0

Entering edit mode

So you think my method in TPM normalization that I introduced above is not good?

Your computation of RPKM using edgeR is fine, provided that the gene lengths from EnsDb.Hsapiens.v86 match the gene annotation originally used to generate the gene counts.

I don't want to give advice about TPM. edgeR is not designed to produce TPMs and I don't recommend TPMs for downstream analyses. Rescaling the RPKM matrix so that every column has the same sum, which is what you have done to get approximate TPM, will undo the library size normalization done by calcNormFactors. That seems a backward step to me. It's hard to imagine what downstream analysis you could be doing that would require you to do that. Why not use the RPKM directly?

the edgeR::rpkm() function worked

Why then do you post a question claiming that it crashed?

it yields the exact same results as my method

The edgeR code shown above will not give the same RPKM as your code.

I googled the results of some of the gene lengths, they are correct.

There is no unique definition of gene length. It depends on what exon annotation you are using and how the genewise counts were generated (which you have not told us). The ambiguity arises for multi-transcript genes rather than for a microRNAs like MIR3975 that only have one exon. ATpoint has already given you a link to my advice about gene lengths.

ADD REPLY • link 20 months ago Gordon Smyth 50k