Question: Gene Expression Matrix (CPM)
gravatar for kkadzi25
2.7 years ago by
kkadzi250 wrote:

Dear forum members :)

I'm a newb at R and bioinformatics so I am finding myself a little flustered by what to do now. I have obtained a gene expression matrix CPM normalized, with close to 15,000 genes and 16 total samples -> 4 samples in 4 groups. I have been trying to follow the pipelines directed in various packages but I always seem to get lost. So just a couple of questions for all of the great people out here:

1) I would like to get p and q values for the comparison groups as well as their fold changes. To get the fold change, can I just average their expression across 4 samples and apply log2(average(treatment)/average(control))? Can I just apply a t test to get the p values between groups, (when I do it the p values seem way to high)?

2) What kinds of tests can I do to check that my data is not skewed in any way i.e. overall check for integrity of the data?

3) I keep seeing TMM term being thrown around on the forum but I still can't see clearly what it means...


Thank you all so much. Any help is greatly appreciated :)

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by kkadzi250

What do you mean by a "CPM normalized" matrix? Do you mean counts-per-million (from the edgeR package perhaps) or something else? Please explain how this matrix was created or where you got this data from so we know what it is. I assume you have RNA-seq counts, although you haven't said that.

You say that you're tried to follow documented pipelines in various packages. The best way forward would be explain what you've been trying to follow and where you got lost. That would be more constructive that just throwing Biconductor out entirely and trying to do a very naive analysis by yourself (e.g., by t-test). The most popular packages for RNA-seq analysis are limma, edgeR and DESeq2. These are large sophisticated packages but they are also all very well documented.

TMM is a normalization method in the edgeR package, which is also often used by limma. It will be very easy to apply if you use the limma or edgeR pipelines. You don't need to worry about it separately.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Gordon Smyth38k

Thank you Gordon for such a quick reply. Yes, by CPM is mean counts-per-milion from the edgeR. We submitted the samples to the RNA-Seq facility and what we received was an excel spreadsheet titled: "Gene Expression Matrix (CPM)" with thousands of genes across 16 samples in 4 groups. I have tried to follow DeSeq2, edgeR user guides but the pipelines start with raw reads,TMM normalized data or RPKM -> almost never with CPM matrix. 

Would you know of any books/tutorials that could guide me starting from this point, perhaps all the way to the visualization etc.?

As an example, I followed but I really doubt that that I can put my Gene Expression Matrix and start at Step 7...


ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by kkadzi250

This makes a lot more sense now, thank you very much Dr. Smyth. I understand that it'd be much better to obtain the raw read counts and follow the pipeline written in the article, but would you say that using and starting with the Data pre-processing step with the CPM Matrix is a good place to start? Would it yield results that are acceptable to publish?

Thank you.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by kkadzi250

A few of the steps in that article (e.g., voom, calcNormFactors, eBayes without trend) are not appropriate if you start with CPM values. It might be a guide though.

Also, please don't post an "Answer" to you own question. If you want to make a followup question, please click on "ADD REPLY" or "ADD COMMENT".

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Gordon Smyth38k
Answer: Gene Expression Matrix (CPM)
gravatar for Gordon Smyth
2.7 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

There are lots of very complete tutorials on how to analyze RNA-seq, for example here is one that I and colleagues published on edgeR a few months ago:

However none of the DE pipelines start with CPM values, because that isn't a useful place to start. You need to go back to your RNA-seq facility and ask them for more complete data. At very least, they need to give you the read counts from which they computed the CPM values. Apparently they've already read the data into edgeR, so they've already started the edgeR pipeline. Perhaps there has been some mis-communication. Perhaps the bioinfomatician at the RNA-seq facility was expecting that they will do the DE analysis for you?

The RNA-seq facility also needs to give you a complete description of how they have pre-processed the RNA-seq data. Without that, you will not be able to publish it in a reputable journal.

If you really were stuck with nothing but CPM values, then the best approach would be to transform to log2 values:

y <- log2(CPM + 0.1)

and then analyse in the limma package as if it was microarray data, using a limma-trend type analysis. Getting the actual counts (and gene lengths) would be a bit better however.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Gordon Smyth38k

Thank you very much for your assistance.

ADD REPLYlink written 2.7 years ago by kkadzi250
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 187 users visited in the last hour