Search
Question: Why the log fold change of some genes returned by edgeR is not compatible with their related FPKM value?
0
2.2 years ago by
Sara0
Sara0 wrote:

Hi all experts,

I have used Trinity software for doing de novo transcriptome assembly. Then, I used bowtie and RSEM within Trinity package for read mapping and counting. Finally, I got the raw count and FPKM value. I used raw count for doing differential expression analysis by edgeR software, say, I compared A-B library. I found that among DEG genes reported by edgeR software, some few genes with positive log fold change, meaning that the higher expression in A library in relative to B, have the lower FPKM value in A as compared with its FPKM value in B library. For instance, please consider the below example

logFC     logCPM                LR           PValue FDR        FPKM (B)             FPKM(A)

c50354_g2_i5     2.825768038       5.659882391       41.70680491       1.06039E-10        6.64373E-07        118.187   18.093

As you could see the FPKM value of c50354_g2_i5 transcript are in 18.093 and 118.187 in the library B and A, respectively. But during the comparison A-B library, the log fold change is positive. Could you please explain to me what happened for such genes?

modified 2.2 years ago by Gordon Smyth35k • written 2.2 years ago by Sara0

The first thing that comes to my mind is that you should probably include the code you used to obtain these values. And the output of sessionInfo().

If you plot edgeR's log fold change esimates vs. your manual FPKM log ratios, are they for the most part concordant with a few outliers, or most of them not concordant, at all?

2
2.2 years ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

If you use edgeR correctly, and compute FPKM correctly, then the logFC and the FPKM will agree.

If you want more help, please give the code by which you computed the logFC and the FPKM. We can't tell you what happened without that information. Perhaps for example you may have simply computed the logFC for B vs A instead of A vs B, and that would agree perfectly with the FPKM values you give.

Later:

It turns out from your later post that the logFC you give is not a simple logFC between TT6 and TS6 but has been adjusted for baseline control values. Furthermore the controls are different for TT and TS. When you looked at the FPKM, you have forgotten to look at and correct for the control values (CT and CS) for this gene.

Thank you very much for the comment. Actually, I have done DE analysis with your great help, Gordon, here. I have 5 libraries (unfortunately without replicate) as below:

Control-Sensitive cell line (CS)

Treatment-Sensitive cell line after 6 hours (TS6)

Control-Tolerant cell line (CT)

Treatment-Tolerant cell line after 6 hours (TT6)

Treatment-Tolerant cell line after 12 hours (TT12)

The following codes were used for DE analysis:

targets <- read.delim("targets3.txt", stringsAsFactors=FALSE)
targets
Label Treatment  line time
1    CS   control sensitive    0
2   TS6      drug sensitive    6
3    CT   control  tolerant    0
4   TT6      drug  tolerant    6
5  TT12      drug  tolerant   12

Group <- factor(c("CS","TS6","CT","TT6","TT12"))
design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
Group2 <- factor(c(1,2,1,2,3))
design2 <- model.matrix(~Group2)

colnames(expressioncount) <- substring(colnames(expressioncount),1,6)
dim(expressioncount)
[1] 144103     6

length CS TS6 CT TT6 TT12
c51581_g1_i1    462  0   2  1   1    0
c19380_g1_i1    439  3   0  2   0    2
c4832_g1_i1     665  0   2  0   7   23
c25894_g1_i1    402  0   1  1   1    3
c37752_g4_i1    736  2   4  2   3    2
c557_g1_i1      434  7   0  4   0    0

y <- DGEList(expressioncount[,2:6], group=Group)
> keep <- rowSums(cpm(y) > 1) >= 1
> y <- calcNormFactors(y)
> y$samples group lib.size norm.factors CS 1 20146232 1.1727230 TS6 1 19308652 0.9188417 CT 1 19792940 1.2271948 TT6 1 19051509 0.9463686 TT12 1 17752012 0.7990795 table(keep) keep FALSE TRUE 68016 76087 y <- estimateGLMCommonDisp(y, design2, verbose=TRUE) Disp = 0.05004 , BCV = 0.2237 dispersion=bcv^2 y <- estimateGLMTrendedDisp(y, design2) fit <- glmFit(y,design) cm <- makeContrasts(TS6 = TS6-CS, TT6 = TT6-CT, TT12 = TT12-CT, TT6.TS6 = (TT6-CT) - (TS6-CS), levels=design) lrt3 <- glmLRT(fit, contrast=cm[,"TT6.TS6"]) is.de <- decideTestsDGE(lrt3) summary(is.de) [,1] -1 60 0 143843 1 200 I have computed FPKM value with the below code: RPKM <- rpkm(y, gene.length=expressioncount$length)

When I checked DE genes (TT6.TS6) the logFC and FPKM value agreed together (positive logFC was equal to higher FPKM value in TT6 than TS6 and vice versa for negative logFC), but some interesting genes with positive logFC had the higher FPKM value in TS6 than TT6!! Please consider the below example:

geneid    logFC    logCPM    LR    PValue    FDR    FPKM_TS6    FPKM_TT6
c41775_g1_i2    2.68363364    5.682588148    25.68892732    4.01122E-07    0.000704913    81.928    33.113
c50354_g2_i5    2.825768038    5.659882391    41.70680491    1.06039E-10    6.64373E-07    118.187    18.093

Please kindly explain to me what happened for such genes. I'm very concerned about the rest results

3

The contrast you are calling TT6.TS6 isn't a comparison between those two samples. It's the interaction term between the tolerant and sensitive samples, so you can't figure out if the logFC makes sense just by looking at the TT6 and TS6 samples! Note that the contrasts are just algebra, and mean exactly what you would think. So when you do

TT6.TS6 = (TT6-CT) - (TS6-CS)

It literally means that you want to make that comparison. So if you want to compare the FPKM changes to the logFC from the model, you have to do the same thing with the FPKM values (e.g., compute TT6 - CT - TS6 + CS using the FPKM samples) and just looking at TT6 - TS6 isn't the same thing.