DESeq2: Is it possible to convert read counts to expression values via TPM and return these values?
1
4
Entering edit mode
@tombernardryan-12173
Last seen 5.7 years ago

 

Dear all, I am so confused, I would really appreciate help. 

I have a table of read counts from RNASeq data (i.e. just a table, where each column is a sample, and each row is a gene, and the cells are read counts that range from 0 to say 10,000). I want to convert these to TPM values and output a matrix/table of TPM gene expression values (each row still a gene name, each column still a sample name).

I was advised that DESeq2 could do this. I was reading the manual: https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf. I thought section 2.4.3 "Starting from count tables" would have all the information I needed. However, the example seems to jump straight from reading in count data, to producing differentially expressed genes. I've found this to be the case with a few places I've read, and I can't seem to find a simple answer anywhere.

Is it possible to input just a table of read counts, and output a table of TMP gene expression values. 

The code would be like:

library("DESeq2")

count_data <-read.table("MatrixOfReadCount",header=T)

And then I was trying to do something like:

countData <-assay(read_table) (but I was getting errors: Unable to find an inherited method for function ‘assay’ for signature ‘"data.frame", "missing"’)

or:

ddsFullCountTable <-DESeqDataSetFromMatrix(countData=read_table)

(I also have other examples of what I tried, but there were all pointless).

I'm just so confused. I've also read other places and forums, and I cannot find how to simply "read in a matrix of read counts (row names = genes, column names = sample), and output a matrix of TPM gene expression values (row names = genes, column names = sample), without doing any differential expression analysis".

If someone could help in any way, I would appreciate it. Can this be done?

deseq2 r • 27k views
ADD COMMENT
14
Entering edit mode
@mikelove
Last seen 4 days ago
United States

If you had access to the FASTQ files, the best way to estimate gene-level TPM in my opinion would be to use fast, lightweight transcript abundance quantifiers like Salmon, Sailfish or kallisto (or RSEM, which is also accurate but not as fast as these other three). These are my preferred way to generate gene-level count matrices, using the Bioconductor tximport package to collate counts and effective lengths into R, and then to load the tximport data into DESeq2 with DESeqDataSetFromTximport().

If you only have counts, you can generate something roughly like TPMs, but you need to know the lengths of the genes. It will be only a rough estimate, because you don't know which isoform for each gene was expressed (or if it was a combination and at what percent). So you'll need to calculate, for each gene (row of your count matrix), what is some number for the length of the gene. We don't have any general utilities in DESeq2 for this for an arbitrary count matrix.

You can create a TPM matrix by dividing each column of the counts matrix by some estimate of the gene length (again this is not ideal for the reasons stated above).

x <- counts.mat / gene.length

Then with this matrix x, you do the following:

tpm.mat <- t( t(x) * 1e6 / colSums(x) )

Such that the columns sum to 1 million.

ADD COMMENT
2
Entering edit mode

Hi Michael,

To get more likely TPM values from counts, could I use Combat-seq to remove batch effect from tximport counts then use batch corrected tximport counts and tximport length(effective gene length) to calculate TPM values? Will that TPM values still far away from a real tximport abundance data?

I know tximport abundance is not calculated from tximport counts. However, combat-seq won't take tximport abundance data. I have read your previous comment - Convert Normalized RNA-Seq counts to TPM after Batch Correction. You have said "if you want batch corrected TPM, I think I would just stay on the abundance scale the whole time (estimating mean shifts and removing them on the log2 scale, and working with log2 TPM for plotting or whatever else you need)." Does it mean working on tximport abundance data directly and remove its batch effects by removing estimating mean shifts on log2 scale? Could you elaborate on it since this method seems quite different from Combat-seq's batch correction method.

Thanks a lot!

Hongwei

ADD REPLY
3
Entering edit mode

This seems pretty reasonable (your first parapraph). I would describe those as batch-effect corrected TPM. If you are using the per-sample effective gene lengths, and then scaling so each column sums to 1e6, that sounds like a good approach.

ADD REPLY
0
Entering edit mode

Thanks a lot for your explanation and help!

ADD REPLY
0
Entering edit mode

Dear Michael Love and all, My objective is to compare gene expression levels within the same group (biological condition) of samples and for that I want to use TPM values. I have used tximport with

 library(tximport) 
 txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
 names(txi)

[1] "abundance" "counts" "length"
[4] "countsFromAbundance"

Will "abundance" will be good for this purpose? Many Thanks

ADD REPLY
1
Entering edit mode

Yes, you this gene-level abundance (TPM) can be compared across genes or samples.

ADD REPLY

Login before adding your answer.

Traffic: 344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6