Question: Differential expression analysis starting from TPM data
1
gravatar for cahidora
23 months ago by
cahidora20
cahidora20 wrote:

Hello,

I am new in this kind of analysis and I have a .csv file containing RNA-Seq data from different cell lines (with at least 3 replicates) normalised to TPM already, unfortunately I cannot access to the raw counts files.

     Symbol       ID              C1    C2     C3      D1      D2            D3          D4
1     TSPAN6 ENSG00000000003.13 133.95 132.07 64.47   54.85   53.65        47.87        56.37
2       TNMD  ENSG00000000005.5  10.39   3.47  1.11    0.58    1.74         0.36         1.68
3       DPM1 ENSG00000000419.11  67.67 124.98 33.02    8.35   12.95        12.31        13.33
4      SCYL3 ENSG00000000457.12   2.59   1.40  2.61    5.03    4.70         2.98         3.71
5   C1orf112 ENSG00000000460.15  12.32  46.18 16.49   19.54   19.20        11.72         8.55
6        FGR ENSG00000000938.11   0.00   0.00  0.04    0.36    0.08         0.00         0.00

So my question is: Is there a way I can follow to obtain the p-values, t-values and padj starting from this .csv file in order to perform a differential expression analysis? I read about DESeq, DESeq2, EdgeR, limma and it looks like if all the R packages would ask for the raw counts. I would like to perform a Differential Expression Analysis. And I tried to follow Differential expression of RNA-seq data using limma and voom() but it is not working.

Any suggestions about how to start? Any help is very appreciated.

R tpm • 8.4k views
ADD COMMENTlink modified 23 months ago by Gordon Smyth37k • written 23 months ago by cahidora20

You will need to be more clear about "not working", the recommendations in that link are the way to go.

ADD REPLYlink written 23 months ago by Aaron Lun24k

Here Differential expression of RNA-seq data using limma and voom() I read that Gordon Smyth does not recommend to  use normalised values in DESeq, DESeq2 and edgeR. So I calculated the average of every group (C and D) and then I calculated the log2FC. With those log2FC values, I tried to follow the limma-trend pipeline described in the limma documentation but I always obtain this error  "row dimension of design doesn't match column dimension of data object". The syntax I am using is the following:

f <- factor(colnames(mysample), levels=unique(colnames(mysample)))
design <- model.matrix(~0+f)
rownames(design) <- levels(f)
colSums(design)
log2fc <- as.matrix(log2fc)
fit <- lmFit(log2fc, design)

Any suggestion?

ADD REPLYlink written 23 months ago by cahidora20
Answer: Differential expression analysis starting from TPM data
9
gravatar for Gordon Smyth
23 months ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

In my opinion, there is no good way to do a DE analysis of RNA-seq data starting from the TPM values. TPMs just throw away too much information about the original count sizes. Sorry, but I'm not willing to make any recommendations, except to dissuade people from thinking that TPMs are an adequate summary of an RNA-seq experiment.

Note that it is not possible to create a DGEList object or CPM values from TPMs, so trying to use code designed for these sort of objects will be counter-productive.

I see that some people in the literature have done limma analyses of the log(TPM+1) values and, horrible though that is, I can't actually think of anything better, given TPMs and existing software. One could make this a little better by using eBayes with trend=TRUE and by using arrayWeights() to try to partially recover the library sizes. Please do not take that as a recommendation though!

ADD COMMENTlink modified 23 months ago • written 23 months ago by Gordon Smyth37k
1

There is no one better than you to answer this question (for good or bad). I appreciate very much your recommendations.

ADD REPLYlink written 23 months ago by cahidora20

Hi Gordon,

I have used hisat2, stringtie, stringtie merge tools for Transcript-level expression analysis of RNA-seq experiment. Stringtie tool estimates transcript abundances and create table counts for "ballgown" for differential analysis. I see both FPKM and TPM values. As you said above that TPM are most preferred for differential analysis comapred to FPKM, raw counts. 

I would like to know whether "limma analyses of the log(TPM+1)" is better or "ballgown" is better for differential analysis?

  

ADD REPLYlink written 15 months ago by Biologist70

As you said above that TPM are most preferred for differential analysis comapred to FPKM, raw counts.

Did you read Gordon's post correctly? Raw counts are the best option for DE analyses, not TPMs or FPKMs. It seems you can get this information from stringtie, which you could then use in voom-limma, edgeR, etc.:

https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

... though it is not clear exactly how the transcript/gene-level read counts are recovered.

ADD REPLYlink modified 15 months ago • written 15 months ago by Aaron Lun24k

Hi Aaron,

Thanks for the information. I have a basic question. After hisat the outputs are bam files. So, we can use featurecounts and htseq to get counts data for each sample and I can use edgeR and deseq2 for differential analysis. But then What will be the use of Stringtie in the analysis?

ADD REPLYlink written 15 months ago by Biologist70
1

For a gene-level DE analysis? None.

The more relevant question is whether you want to do a gene-level or transcript-level analysis. If the latter, the link above suggests that you could get some counts out of stringtie to use in edgeR and co.

ADD REPLYlink written 15 months ago by Aaron Lun24k

Dear Gordon and Everybody,

Our sequencing core recently switched from STAR to Kallisto so that I either have to work from their TPM values or align the fastq files to the genome myself (I would use Rsubread). Is it recommended to recover the counts from the Kallisto TPMs with tximport? As I understand it such counts will be non-integral. Soneson, Love, and Robinson have presented evidence that the methods they simplesum_avextl is as good or better for differential expression that alignment+featureCounts. In a commentary to that paper, Lior Pachter advocates simple adding the reads mapped to each transcript to get the reads for a gene. I would greatly appreciate Gordon's or someone from his groups input as to whether there is a proper way to get counts from TPMs for input to edgeR or limma-voom.

Thanks and best wishes, Rich Richard Friedman, Columbia University “My father justifies mice. You need to make sure that you have enough mice for an experiment and that you do not have too many. He makes sure that no mouse dies in vain. You are not allowed to use chimps, so you have to use mice”- Rose Friedman, age 22

ADD REPLYlink written 5 months ago by raf420

Please don't just add comments to old posts. If you want to ask a new question (particularly if you want to ask a question that isn't already answered in the existing thread).

And do note that your understanding of kallisto and tximport is incorrect. Kallisto reports estimated counts, which is by default the value used by tximport, not the TPM values.

ADD REPLYlink written 5 months ago by James W. MacDonald50k

Jim,

Thank you for the correction with respect to how to ask my question. One of the files I received from the sequencing core was labeled tpmvaluesgenes_kallisto.txt, which gave me the impression that tpm was teh primary quantity.

Best wishes, Rich

ADD REPLYlink written 5 months ago by raf420

Jim,

Thank you for the correction with respect to how to ask my question. One of the files I received from the sequencing core was labeled tpmvaluesgenes_kallisto.txt, which gave me the impression that tpm was teh primary quantity.

Best wishes, Rich

ADD REPLYlink written 5 months ago by raf420

Jim,

From the original Kallisto paper,Bray, et al., Nature Biotech 34, p.525, online methods: "The transcript abundances are output by Kalllisto in transcripts per million (TPM) units". I will rephrase my question as a separate query, incorporating your point about estimated counts. Thank you as always for your help.

Best wishes, Rich

ADD REPLYlink written 5 months ago by raf420
Answer: Differential expression analysis starting from TPM data
0
gravatar for Aaron Lun
23 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

It doesn't make any sense to fit a linear model to the log-fold changes between groups. Obviously a design matrix constructed from the samples will not have the same dimensions as the matrix of log-fold changes between groups, hence the error. If you want to use limma-trend, you should fit the model to the log-CPMs.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Aaron Lun24k

Gordon suggests this:

logCPM <- cpm(dge, log=TRUE, prior.count=3)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design)) 

that is why I was trying to create the variable "design". In "dge" i am using the log2FC values, is that right?

ADD REPLYlink written 23 months ago by cahidora20

No, dge should contain a count matrix or a DGEList object. Read ?cpm. If you already have a matrix of log-CPMs (columns = samples, rows = genes), then there is no need to run cpm.

ADD REPLYlink modified 23 months ago • written 23 months ago by Aaron Lun24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 338 users visited in the last hour