Differential expression analysis starting from TPM data
2
4
Entering edit mode
cahidora ▴ 50
@cahidora-13654
Last seen 4.2 years ago

Hello,

I am new in this kind of analysis and I have a .csv file containing RNA-Seq data from different cell lines (with at least 3 replicates) normalised to TPM already, unfortunately I cannot access to the raw counts files.

     Symbol       ID              C1    C2     C3      D1      D2            D3          D4 1     TSPAN6 ENSG00000000003.13 133.95 132.07 64.47   54.85   53.65        47.87        56.37 2       TNMD  ENSG00000000005.5  10.39   3.47  1.11    0.58    1.74         0.36         1.68 3       DPM1 ENSG00000000419.11  67.67 124.98 33.02    8.35   12.95        12.31        13.33 4      SCYL3 ENSG00000000457.12   2.59   1.40  2.61    5.03    4.70         2.98         3.71 5   C1orf112 ENSG00000000460.15  12.32  46.18 16.49   19.54   19.20        11.72         8.55 6        FGR ENSG00000000938.11   0.00   0.00  0.04    0.36    0.08         0.00         0.00

So my question is: Is there a way I can follow to obtain the p-values, t-values and padj starting from this .csv file in order to perform a differential expression analysis? I read about DESeq, DESeq2, EdgeR, limma and it looks like if all the R packages would ask for the raw counts. I would like to perform a Differential Expression Analysis. And I tried to follow Differential expression of RNA-seq data using limma and voom() but it is not working.

Any suggestions about how to start? Any help is very appreciated.

R TPM • 22k views
0
Entering edit mode

You will need to be more clear about "not working", the recommendations in that link are the way to go.

0
Entering edit mode

Here Differential expression of RNA-seq data using limma and voom() I read that Gordon Smyth does not recommend to  use normalised values in DESeq, DESeq2 and edgeR. So I calculated the average of every group (C and D) and then I calculated the log2FC. With those log2FC values, I tried to follow the limma-trend pipeline described in the limma documentation but I always obtain this error  "row dimension of design doesn't match column dimension of data object". The syntax I am using is the following:

f <- factor(colnames(mysample), levels=unique(colnames(mysample)))
design <- model.matrix(~0+f)
rownames(design) <- levels(f)
colSums(design)
log2fc <- as.matrix(log2fc)
fit <- lmFit(log2fc, design)

Any suggestion?

16
Entering edit mode
@gordon-smyth
Last seen 16 minutes ago
WEHI, Melbourne, Australia

In my opinion, there is no good way to do a DE analysis of RNA-seq data starting from the TPM values. TPMs just throw away too much information about the original count sizes. Sorry, but I'm not willing to make any recommendations, except to dissuade people from thinking that TPMs are an adequate summary of an RNA-seq experiment.

Note that it is not possible to create a DGEList object or CPM values from TPMs, so trying to use code designed for these sort of objects will be counter-productive.

I see that some people in the literature have done limma analyses of the log(TPM+1) values and, horrible though that is, I can't actually think of anything better, given TPMs and existing software. One could make this a little better by using eBayes with trend=TRUE and by using arrayWeights() to try to partially recover the library sizes. Please do not take that as a recommendation though!

1
Entering edit mode

There is no one better than you to answer this question (for good or bad). I appreciate very much your recommendations.

0
Entering edit mode

Hi Gordon,

I have used hisat2, stringtie, stringtie merge tools for Transcript-level expression analysis of RNA-seq experiment. Stringtie tool estimates transcript abundances and create table counts for "ballgown" for differential analysis. I see both FPKM and TPM values. As you said above that TPM are most preferred for differential analysis comapred to FPKM, raw counts.

I would like to know whether "limma analyses of the log(TPM+1)" is better or "ballgown" is better for differential analysis?

0
Entering edit mode

As you said above that TPM are most preferred for differential analysis comapred to FPKM, raw counts.

Did you read Gordon's post correctly? Raw counts are the best option for DE analyses, not TPMs or FPKMs. It seems you can get this information from stringtie, which you could then use in voom-limma, edgeR, etc.:

https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

... though it is not clear exactly how the transcript/gene-level read counts are recovered.

0
Entering edit mode

Hi Aaron,

Thanks for the information. I have a basic question. After hisat the outputs are bam files. So, we can use featurecounts and htseq to get counts data for each sample and I can use edgeR and deseq2 for differential analysis. But then What will be the use of Stringtie in the analysis?

1
Entering edit mode

For a gene-level DE analysis? None.

The more relevant question is whether you want to do a gene-level or transcript-level analysis. If the latter, the link above suggests that you could get some counts out of stringtie to use in edgeR and co.

0
Entering edit mode

Dear Gordon and Everybody,

Our sequencing core recently switched from STAR to Kallisto so that I either have to work from their TPM values or align the fastq files to the genome myself (I would use Rsubread). Is it recommended to recover the counts from the Kallisto TPMs with tximport? As I understand it such counts will be non-integral. Soneson, Love, and Robinson have presented evidence that the methods they simplesum_avextl is as good or better for differential expression that alignment+featureCounts. In a commentary to that paper, Lior Pachter advocates simple adding the reads mapped to each transcript to get the reads for a gene. I would greatly appreciate Gordon's or someone from his groups input as to whether there is a proper way to get counts from TPMs for input to edgeR or limma-voom.

Thanks and best wishes, Rich Richard Friedman, Columbia University “My father justifies mice. You need to make sure that you have enough mice for an experiment and that you do not have too many. He makes sure that no mouse dies in vain. You are not allowed to use chimps, so you have to use mice”- Rose Friedman, age 22

0
Entering edit mode

And do note that your understanding of kallisto and tximport is incorrect. Kallisto reports estimated counts, which is by default the value used by tximport, not the TPM values.

0
Entering edit mode

Jim,

Thank you for the correction with respect to how to ask my question. One of the files I received from the sequencing core was labeled tpmvaluesgenes_kallisto.txt, which gave me the impression that tpm was teh primary quantity.

Best wishes, Rich

0
Entering edit mode

Jim,

Thank you for the correction with respect to how to ask my question. One of the files I received from the sequencing core was labeled tpmvaluesgenes_kallisto.txt, which gave me the impression that tpm was teh primary quantity.

Best wishes, Rich

0
Entering edit mode

Jim,

From the original Kallisto paper,Bray, et al., Nature Biotech 34, p.525, online methods: "The transcript abundances are output by Kalllisto in transcripts per million (TPM) units". I will rephrase my question as a separate query, incorporating your point about estimated counts. Thank you as always for your help.

Best wishes, Rich

0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 16 hours ago
The city by the bay

It doesn't make any sense to fit a linear model to the log-fold changes between groups. Obviously a design matrix constructed from the samples will not have the same dimensions as the matrix of log-fold changes between groups, hence the error. If you want to use limma-trend, you should fit the model to the log-CPMs.

0
Entering edit mode

Gordon suggests this:

logCPM <- cpm(dge, log=TRUE, prior.count=3)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design)) 

that is why I was trying to create the variable "design". In "dge" i am using the log2FC values, is that right?

0
Entering edit mode

No, dge should contain a count matrix or a DGEList object. Read ?cpm. If you already have a matrix of log-CPMs (columns = samples, rows = genes), then there is no need to run cpm.