Question: About differentially expressed genes that found by edgeR
0
3.2 years ago by
Sara0
Sara0 wrote:

Hi all experts,

After doing DE analysis by edgeR, we can find the number of up and down-expressed genes by something like, "summary(de <- decideTestsDGE(lrt5))". I would  highly appreciate if you could please help me out to figure out the following issues, which I cannot completely understand with reading and browsing net.

1) Please let me know how fold change (FC) is calculated during GLM method, it's based on cpm? say we compare  A and B conditions, fold change of DE genes are calculated as cpmA / cpmB or there is another story? if your response is positive, why the software doesn't use RPKM(FPKM) that take into account the gene length? is there any priority for using CPM in relative to FPKM?

2) Please kindly tell me how we can export DE genes with all related information, like logFC, logCPM, P-value and FDR from edgeR software?

3) As I found on net, we can convert cpm to rpkm via this formula: RPKM = 2^(logCPM-log2(geneLength)). Please let me know if the same formula can be ued for FPKM?

4) If I correctly understood from edgeR manual, drawing heatmap by logCPM is more desirable than logFPKM? if it's right, please kindly tell me why there is such preference?

edger cpm de analysis fold change • 1.0k views
modified 3.2 years ago by Ryan C. Thompson7.4k • written 3.2 years ago by Sara0
1
3.2 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.4k wrote:

Fold changes are not calculated as simple ratios. Assuming you are using the GLM pipeline, FC is calculated by fitting a negative binomial generalized linear model to your design matrix. The fitted coefficients, or differences between them, are the log fold changes. You can read the edgeR papers for more information.

The differences between CPM, RPKM, FPKM, etc. have been discussed repeatedly and at length both on this site and elsewhere. Here's a good primer: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ edgeR and other similar tools use raw counts because their statistical model assumes that the input data is raw counts (or at least on the same scale as raw counts).

For exporting a table of results, look at topTags followed by as.data.frame.

I rarely use heatmaps since I've never found heatmaps to be particularly informative, so I can't advise you on that issue.

1

1. Have a look at ?predFC. This function adds a prior count to avoid undefined or unstable log-fold changes due to zeroes and low counts respectively, and then fits a GLM to these modified counts. The coefficients corresponding to differences between conditions are then directly used as the log-fold changes. The reason that we don't correct for gene length is because it would just cancel out when you compute the (log-)fold change for any given gene.
2. If you just access the \$table member of the topTags output, you don't even need to coerce it to a data.frame.
3. The gene length needs to be in terms of number of kilobases. If you must compute these values, use rpkm instead. This function can also be used for FPKM, which is basically the same as RPKM (see Ryan's link).
4. There's no point using log-RPKM for heatmaps because the gene length will cancel out when you perform mean-centering of the rows. See the relevant section of http://f1000research.com/articles/5-1438 for details on how to mean-center.

Thank you very much for all responses. They were really helpful. However, I'm still confused about logFC. I did the similar analysis using edgeR within Trinity package, it uses just exact test for DE analysis; here, when two samples,A and B (A-B), compared, genes with higher FPKM in sample B than A get the positive logFC and vice versa. I was wondering if it is correct when DE analysis is done by edgeR in R (out of Trinity package)? my mean is can we judge the up and down-regulated genes among DE genes resulted from edgeR analysis by looking at their FPKM values?

For exporting DE genes from edgeR, I used the following commands

> result <- topTags(lrt5, p=0.05, adjust.method="BH", sort.by="logFC")
> write.table(result, "DEexport.txt" )

But, the "DEexport.txt" contained just 10 rows, while the number of DE is about 1500 based on the output of "summary(de <- decideTestsDGE(lrt5)) ". Could you please let me know how I can get all 1500 DE genes?

Regarding the log-fold change, if you don't post any code, it's impossible to tell what's happening. But why are you bothering with FPKMs? If you want to know up- or down-regulation, just look at the sign of the log-FC.

For exporting, have a look at ?topTags and the n argument. Setting n=Inf will get you all genes.

Thanks for the comment. Regarding the logFC and FPKM, here is the code, this script https://github.com/trinityrnaseq/trinityrnaseq/blob/master/Analysis/DifferentialExpression/run_DE_analysis.pl, within Trinity package made a DE analysis by edgeR (or other software) easy for the beginners, like me. You're right with looking at the sign of the logFC for identification of up and down-expressed genes. However, during the transcript analysis, I face with several transcripts of a single gene that have different FPKM values, as I read, those transcripts with the higher FPKM are more reliable than others while those with low FPKM may be artifacts. It is useful for our biologists when we try to experimentally study expression analysis in the lab, say by qRT-PCR, ... That's why I'm looking for FPKM values in addition to the sign of logFC.

Regarding topTags, yes, I tried n=Inf, but as you mentioned it returned all genes not just DE genes. Please kindly tell me how to modify the above code to get just DE genes at the FDR cutoff of 0.05?

Thanks

No response, yet. Please any suggestions!

This is the point at which you are usually admonished to read the help page. Actually, Aaron already pointed you to the help page when he said to look at ?topTags.

From ?topTags:

Usage:

Arguments:

object: a 'DGEExact' object (output from 'exactTest') or a 'DGELRT'
object (output from 'glmLRT'), containing the (at least) the
elements 'table': a data frame containing the
log-concentration (i.e. expression level), the log-fold
change in expression between the two groups/conditions and
the p-value for differential expression, for each tag. If it
is a 'DGEExact' object, then 'topTags' will also use the
'comparison' element, which is a vector giving the two
experimental groups/conditions being compared. The object may
contain other elements that are not used by 'topTags'.

n: scalar, number of tags to display/return

p-values for multiple testing, passed on to 'p.adjust'

sort.by: character string, should the top tags be sorted by p-value
('"PValue"'), by absolute log-fold change ('"logFC"'), or not
sorted ('"none"').

p.value: cutoff value for adjusted p-values. Only tags with lower
p-values are listed.

Learning how to figure out the answer yourself is an invaluable skill, if you expect to be doing anything with R/Bioconductor (or any Open Source software, for that matter). There is limited support, and you will find that in the vast majority of cases it is faster and way more useful for you to figure out the answer yourself.