Dear forum members :)
I'm a newb at R and bioinformatics so I am finding myself a little flustered by what to do now. I have obtained a gene expression matrix CPM normalized, with close to 15,000 genes and 16 total samples -> 4 samples in 4 groups. I have been trying to follow the pipelines directed in various packages but I always seem to get lost. So just a couple of questions for all of the great people out here:
1) I would like to get p and q values for the comparison groups as well as their fold changes. To get the fold change, can I just average their expression across 4 samples and apply log2(average(treatment)/average(control))? Can I just apply a t test to get the p values between groups, (when I do it the p values seem way to high)?
2) What kinds of tests can I do to check that my data is not skewed in any way i.e. overall check for integrity of the data?
3) I keep seeing TMM term being thrown around on the forum but I still can't see clearly what it means...
Thank you all so much. Any help is greatly appreciated :)
What do you mean by a "CPM normalized" matrix? Do you mean counts-per-million (from the edgeR package perhaps) or something else? Please explain how this matrix was created or where you got this data from so we know what it is. I assume you have RNA-seq counts, although you haven't said that.
You say that you're tried to follow documented pipelines in various packages. The best way forward would be explain what you've been trying to follow and where you got lost. That would be more constructive that just throwing Biconductor out entirely and trying to do a very naive analysis by yourself (e.g., by t-test). The most popular packages for RNA-seq analysis are limma, edgeR and DESeq2. These are large sophisticated packages but they are also all very well documented.
TMM is a normalization method in the edgeR package, which is also often used by limma. It will be very easy to apply if you use the limma or edgeR pipelines. You don't need to worry about it separately.
Thank you Gordon for such a quick reply. Yes, by CPM is mean counts-per-milion from the edgeR. We submitted the samples to the RNA-Seq facility and what we received was an excel spreadsheet titled: "Gene Expression Matrix (CPM)" with thousands of genes across 16 samples in 4 groups. I have tried to follow DeSeq2, edgeR user guides but the pipelines start with raw reads,TMM normalized data or RPKM -> almost never with CPM matrix.
Would you know of any books/tutorials that could guide me starting from this point, perhaps all the way to the visualization etc.?
As an example, I followed http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#differential-expression-analysis but I really doubt that that I can put my Gene Expression Matrix and start at Step 7...
This makes a lot more sense now, thank you very much Dr. Smyth. I understand that it'd be much better to obtain the raw read counts and follow the pipeline written in the article, but would you say that using https://www.bioconductor.org/help/workflows/RNAseq123/ and starting with the Data pre-processing step with the CPM Matrix is a good place to start? Would it yield results that are acceptable to publish?
Thank you.
A few of the steps in that article (e.g., voom, calcNormFactors, eBayes without trend) are not appropriate if you start with CPM values. It might be a guide though.
Also, please don't post an "Answer" to you own question. If you want to make a followup question, please click on "ADD REPLY" or "ADD COMMENT".