Question

(not) Understanding EdgeR coefficients

0

Entering edit mode

cpcantalapiedra • 0

@cpcantalapiedra-7532

Last seen 9.0 years ago

Spain

This is a head of my dataset, with expected counts from RSEM, with 3 biological replicates of 2 samples (control and treatment), as can be seen in the group variable below.

> head(rsem, 10)
                   X4_7 X7_9 X8_13 X7_10 X4_27 X5_25
comp0_c0_seq1         0    0     0     0     0     0
comp100001_c0_seq1    0    1     3     2     1     1
comp100006_c0_seq1    2    0     0     0     0     2
comp100009_c0_seq1    0    1     0     0     0     0
comp100010_c0_seq1    0    0     0     0     0     0
comp100012_c0_seq1    0    3     0     3     1     0
comp100013_c0_seq1    0    0     0     3     3     6
comp100017_c0_seq1    0    0     0     0     0     0
comp100018_c0_seq1    2    0     2     3     5     4
comp100019_c0_seq1    0    0     0     3     0     0

> group
[1] CS CS CS TS TS TS
Levels: CS TS

Contrasts:

      Contrasts
Levels treatment
    CS        -1
    TS         1

After filtering and running edgeR I obtain the results table:

> head(glm$glm_lrt$table, 10)
                      logFC    logCPM        LR       PValue
comp0_c0_seq1      0.000000 -4.010505  0.000000 1.0000000000
comp100010_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp100013_c0_seq1 4.891693 -3.025308 12.954572 0.0003191411
comp100017_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp100025_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp100035_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp100037_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp100038_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp100039_c0_seq1 0.000000 -4.010505  0.000000 1.0000000000
comp10003_c0_seq1  1.186745 -1.847143  2.686438 0.1012058293

These are the respective coefficients (I guess):

> head(glm$glm_lrt$coefficients, 10)
                          CS        TS
comp0_c0_seq1      -19.36797 -19.36797
comp100010_c0_seq1 -19.36797 -19.36797
comp100013_c0_seq1 -19.36797 -15.97731
comp100017_c0_seq1 -19.36797 -19.36797
comp100025_c0_seq1 -19.36797 -19.36797
comp100035_c0_seq1 -19.36797 -19.36797
comp100037_c0_seq1 -19.36797 -19.36797
comp100038_c0_seq1 -19.36797 -19.36797
comp100039_c0_seq1 -19.36797 -19.36797
comp10003_c0_seq1  -15.84720 -15.02462

> hist(glm$glm_lrt$coefficients)

coefficients histogram

Is it normal to have only negative values?

I guess these are the b(g) coefficients from:

"Here xi is a vector of covariates that specifies the treatment conditions applied to RNA sample i, and βg is a vector of regression coefficients by which the covariate effects are mediated for gene g." from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378882/

In the edgeR code (glmfit.R):

logFC <- drop((glmfit$coefficients %*% contrast)/log(2))

Where I guess the glmfit$coefficients are the beta(g) above and the "contrast" variable is the x(i) vector.

Is this the same formula? If not, what is the difference? If so, why is log(mu(gi)) = logFC?

From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378882/ again:

"We denote the total number of mapped reads in library i by Ni and the number that map to the g-th gene by ygi."

Therefore, I would understand that

logFC = log(mu(gi) - mu(gj)) ~ x(i)*beta(i)+x(j)*beta(j)

for samples i, j. Is that correct?

Thank you in advance

edger • 4.1k views

ADD COMMENT • link updated 9.1 years ago by Gordon Smyth 50k • written 9.1 years ago by cpcantalapiedra • 0

0

Entering edit mode

I can probably guess, but how did you parametrize your design matrix? Also, what's your call to glmLRT?

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

design <- model.matrix(~0+group)

colnames(design) <- levels(glm$samples$group)

my.constrasts <- makeContrasts(treatment = TS-CS, levels=design)

glm$glm_fit <- function(glm, design){

glm <- estimateGLMCommonDisp(glm, design)

glm <- estimateGLMTrendedDisp(glm, design)
glm <- estimateGLMTagwiseDisp(glm, design)

return(glm)

}

glm$glm_lrt <- glmLRT(glm$glm_fit, contrast=my.constrasts[,"treatment"])

ADD REPLY • link 9.1 years ago cpcantalapiedra • 0

score 2 · Answer 1 · 2015-03-30

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 minute ago

The city by the bay

For your first question: the negative values make sense. Remember that your library sizes are going to be huge relative to the counts for each gene. This results in a big offset (i.e., log(N(i)) term in the first equation of your original post). In your design, each coefficient seems to represent a group, so the expression of each sample is equal to the appropriate coefficient plus the offset. If the offset is large and positive, then the coefficient must be large and negative to get a sensible log-fitted value for that sample.

For your second question: I don't know how you concluded that log(mu(gi)) = logFC from the code in glmfit.R. If you have a contrast of t(c(-1, 1)), then the log-fold change should be equal to the differences between the TS and CS coefficients. For example, if you have coefficients of -10 and -11 for CS and TS respectively, then the log-fold change for that gene should be -10*-1 + -11*1 = -1.

For your third question: no, that's not right. The log-fold change is the difference in the estimated coefficients for your two groups, not the log-transformed difference in means. The latter doesn't give a fold change, nor does it account for differences in library size between the samples in the two groups.

More generally, I'd advise filtering to get rid of very low counts - or, at least, the genes that have all-zero rows. These provide no information for dispersion estimation and DE testing, and are contributing to the dominance of zero log-fold changes in your result table.

---

Edit: the log-FC is reported as base 2, not base e, so the example should actually have a log-FC of -1/log(2).

ADD COMMENT • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

1st question. It is nice to know! 2nd question. I concluded that only by comparison of the equation in the paper and the equation in the code, with the right side of equations looking like they are the same but having in the left side log(mu(gi)) in the paper and logFC in the code. In fact, what you say "then the log-fold change for that gene should be -10*-1 + -11*1 = -1" is what I understood and what the code says. But in the paper says log(mu(gi)) = x(i) * b(g) + logNi, and that is what I dont understand. 3rd question: my mistake it is the attempt to reconcile the equations from question 2. Could I interpret the coefficients as an estimate of the mean in a NB distribution correcting for dispersion and for library size? I guess the problem is my lack of understanding of statistics, but I would have liked a definition of FC in your paper maybe.

About filtering, I am about to post another question, because we have applied a different kind of filter based on what qRT-PCR data seemed to tell us in comparison to RNAseq FCs, and I would love to discuss them and to have help interpreting them.

I am glad you are answering so nice and fast. Thank you very much.

edit: the post about filters Different filters with edgeR

ADD REPLY • link 9.1 years ago cpcantalapiedra • 0

0

Entering edit mode

Ah sorry, another question about the 1st one :) Is it wrong if I add +20 to all the coefficients to have them in positive scale. And, if I would like to use the coefficients for comparisons with samples in other analyses a) are they comparable among different analyses? b) should I transform then again to natural scale to avoid log differences? (e.g.: a gene changes from 16 to 18 in 2 samples, different to another one with coefficients 6 and 8 in the same samples).

ADD REPLY • link 9.1 years ago cpcantalapiedra • 0

1

Entering edit mode

I would be very cautious about comparing coefficients - and more generally, gene expression - between samples from two independent data sets. Any comparison will probably be confounded by batch effects (e.g., different cell culture methods, sequencing protocols) such that the differences are unlikely to have any biological meaning.

A safer approach would be to compare DE log-fold changes between analyses. The idea is that any batch effects should cancel out for a DE comparison between samples in one data set, such that the log-fold changes can be meaningfully compared between data sets.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

score 0 · Answer 2 · 2015-03-30

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

edgeR gives you a nice topTable summary of results from your analysis. All the columns in the topTable are easily explained and easily understood. I don't quite understand why you aren't using the topTable and are instead trying to recompute the logFC for yourself. There is no need to understand the code to use the package, nor is there any reason for you to extract coefficients or other quantities from the fit object yourself.

The formulas that you quote from the McCarthy et al (2012) paper explain how the generalized linear model is defined and fitted. All the quantities in the formulas translate directly in the corresponding quantities in edgeR. The beta are the coefficients, the x_i are the rows of the design matrix, the N are the library sizes and the mu are the fitted.values.

The mathematical formulas describe how the linear model is fit, and the edgeR function which fits the glms is glmFit(). But the code you quote is instead from glmLRT, a different function, so the code and the mathematical formula are not referring to the same thing. In fact the contrast is not a quantity that appears in any of the formulas you quote.

You have misquoted the first formula from the paper in your first reply to Aaron. You seem to have overlooked the fact that x_i and beta_j are vectors in the formula.

If you really want to understand the coefficients in linear models, then I suggest that you trying redoing you analysis using the design matrix approach of Section 3.2.5 of the edgeR User's Guide. You won't need any contrast in that approach. I think you might find it interesting see how the coefficients change in this approach but the logFC do not.

ADD COMMENT • link 9.1 years ago Gordon Smyth 50k

0

Entering edit mode

"edgeR gives you a nice topTable summary of results from your analysis. All the columns in the topTable are easily explained and easily understood. I don't quite understand why you aren't using the topTable and are instead trying to recompute the logFC for yourself."

Apart from the FDR column I do not see any difference with the glm_lrt$table. I am not trying to recalculate logFC.

"There is no need to understand the code to use the package, nor is there any reason for you to extract coefficients or other quantities from the fit object yourself."

We want to cluster genes by logFC. But we have different experiments so we thought that using the coefficients instead of the logFC would yield better results. For example:

GeneA_ExperimentA_logFC = 2

GeneA_ExperimentB_logFC = 4

We have 2 values to compare with other genes for the clustering. However, we thought that by using coefficients we have 4 values (in the example, if we had 2 conditions), and moreover, the relative value of the coefficient within an experiment would give a different kind of information. Lets say:

GeneA_ExperimentA_logFC = 2 --> coefficients -20, -18

GeneA_ExperimentB_logFC = 4 --> coefficients -20, -16

GeneB_ExperimentA_logFC = 2 --> coefficients -18, -16

GeneB_ExperimentB_logFC = 4 --> coefficients -18, -14

We thought that maybe is better to try to cluster {-20, -18, -20, -16} with {-18, -16, -18, -14} than just {2, 4} with {2, 4}. But of course we might be completely wrong and that is what I am trying to know.

edit: note that the control samples are the same for both contrasts but the coefficients can slightly change due to the separate processing of experiments, including filtering. I have edited the coefficients values accordingly.

"But the code you quote is instead from glmLRT, a different function, so the code and the mathematical formula are not referring to the same thing."

UUmm ok. It is obvious that I must be mixing up things. I should study more statistics I am afraid...

"In fact the contrast is not a quantity that appears in any of the formula you quote."

What do you mean? I have quoted the next formula in which a variable "contrast" appears.

logFC <- drop((glmfit$coefficients %*% contrast)/log(2))

"You have misquoted the first formula from the paper in your first reply to Aaron. You seem to have overlooked the fact that x_i and beta_j are vectors in the formula."

Do you mean this?
log(mu(gi)) = x(i) * b(g) + logNi

Because I was just trying to reproduce:

which is in the paper, "GLMs" section.

I understand that they are vectors. Both from the paper:

"Here xi is a vector of covariates that specifies the treatment conditions applied to RNA sample i, and βg is a vector of regression coefficients by which the covariate effects are mediated for gene g."

from Aaron nice examples:

"then the log-fold change for that gene should be -10*-1 + -11*1 = -1"

And from the code:

glmfit$coefficients %*% contrast

https://stat.ethz.ch/R-manual/R-devel/library/base/html/matmult.html

In fact I tried above to decompose the vector multiplication to try to understand the difference between the log(mu) and the logFC:

logFC = log(mu(gi) - mu(gj)) ~ x(i)*beta(i)+x(j)*beta(j)

which Aaron already explained is a mistake.

But maybe, as you think I have overlooked that fact, you are trying to give me some cue about what (of many things) I do not understand, which is not exactly that (I guess). Maybe it is just that I use "*" instead of the R operator "%*%" in the quote.

"If you really want to understand the coefficients in linear models, then I suggest that you trying redoing you analysis using the design matrix approach of Section 3.2.5 of the edgeR User's Guide. You won't need any contrast in that approach. I think you might find it interesting see how the coefficients change in this approach but the logFC do not."

Yes, thank you:

CS as intercept:

                          CS         TS
comp100001_c0_seq1 -19.17876  0.0000000
comp100006_c0_seq1 -16.08734 -0.1054512
comp100009_c0_seq1 -19.17876  3.0064653
comp100010_c0_seq1 -19.17876  0.0000000
comp100013_c0_seq1 -19.17876  3.1984217
comp100017_c0_seq1 -14.92306 -2.0847657
comp100020_c0_seq1 -19.17876  3.5584521

E = 1*TS + CS = -16.19279

E = 0*TS + CS = -16.08734

CS as factor:

                          CS        TS
comp100001_c0_seq1 -19.17876 -19.17876
comp100006_c0_seq1 -16.08734 -16.19279
comp100009_c0_seq1 -19.17876 -16.17229
comp100010_c0_seq1 -19.17876 -19.17876
comp100013_c0_seq1 -19.17876 -15.98034
comp100017_c0_seq1 -14.92306 -17.00782
comp100020_c0_seq1 -19.17876 -15.62031

E = 1*TS + 0*CS = -16.19279

E = 0*TS + 1*CS = -16.08734

By the way, the examples in page 26 should be changed: glmLRT(fit, coef=2) instead of glmLRT(y, coef=2) (page 26 of http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf). Or maybe y <- glmFit(y, design) instead of fit <- glmFit(y, design)?

> fit <- glmFit(y, design)
> lrt <- glmLRT(y, coef=2)

The same with the next line:

> lrt <- glmLRT(y, coef=3)

I think is OK in page 27.

ADD REPLY • link 9.1 years ago cpcantalapiedra • 0