Question

How do I use Limma to produce expression for each group, their P value and adjusted P value, together with affy probe_id and gene symbol

1

Entering edit mode

colonppg ▴ 30

@colonppg-7771

Last seen 8.9 years ago

United States

Dear All:

I am not sure if anyone has asked similar questions before, however, my reading of the limma package documents and internet discussions have not been fruitful, essentially

If I have an Affymetrix dataset, with 3 categories: A, B, C

what I did was:

f<-factor(disease, levels=c("A", "B", "C"))

design <- model.matrix(~0 + f)

colnames(design) <- c("A", "B", "C")

contrast.matrix<-makeContrasts(c.vs.a=C-A, b.vs.a=B-A, c.vs.b=C-B, levels=design)
fit2<-contrasts.fit(fit, contrast.matrix)
fit2.eBayes<-eBayes(fit2)

Let me know if I am doing the right thing here...I understand I can either use write.fit or topTable after that...

however, it seems the exact final format of result just evades me......

what I need, is for group C compared to group A, with the following information for all genes:

gene_id, average_groupC, average_groupA, fold_change, P value, adjusted_pvalue

The same goes for C vs B, B vs A comparison... I really could not care less about coefficient......guess I am not a statistician :) -- do let me know what coefficient means when I compare two groups...

--not sure if this capacity is already built in... or that I can extract bits, pieces of info and combine them, however, I do not see such thing in either topTable or write.fit output... the AveExpr is for across all data which to me is useless...

This format of result is very handy when I need to submit it to IPA or GeneGO analysis...... and I believe it is should be the default output for the package for convenience of the many biologists

Thanks!

LIMMA • 5.4k views

ADD COMMENT • link updated 10.7 years ago by James W. MacDonald 68k • written 10.7 years ago by colonppg ▴ 30

score 2 · Answer 1 · 2015-05-19

You're almost there. To get statistics for the comparison between, say, C and A, all you have to do is:

topTable(fit2.eBayes, coef="c.vs.a", n=Inf, sort.by="none")

This will give you the log-fold change, the p-value and the FDR-adjusted counterpart for each gene. Same goes for the comparisons between C and B or B and A, using coef="c.vs.b" or coef="b.vs.a", respectively.

If you want the average of each group, then that is easily extracted for each gene as fit$coefficients. Each coefficient of your design matrix represents the average expression of a corresponding group, so you can directly interpret the coefficients like that.

Another equivalent approach is to apply the avearrays function on the expression values. Let y be whatever you put into lmFit. To get the average for each disease factor, you can do:

avearrays(y, ID=f)

One of the reasons that such information is not routinely reported is because there is no guarantee that the design is formulated in a one-way layout. For example, the presence of blocking factors may render the concept of a "group" meaningless in some studies.

score 1 · Answer 2 · 2015-05-20

I didn't say average value was irrelevant. I said it is uninterpretable. It is only interpretable in comparison to the expression value of the same gene in a different sample, which is the fold change. In other words, the fold change is interpretable, but you cannot say the same for the average expression value, because it is an unknown combination of starting mRNA concentration, GC content, and probably a myriad of other factors.

You can also argue that some value X is 'expressed' and some value Y is not expressed, except I have presented you with data showing that given the GC content of the probes, you can get values from 1 all the way up to 7 when you know for a fact that there is no expression at all! So yes, you could argue that in general a value of 5 usually means it isn't expressed. And I would say, yeah, probably, unless the probes have high AT content, in which case 5 may be highly expressed, or if the probes have high GC content, in which case there may not be any expression at all.... So you are arguing that the expression values 'mean something' and I am saying that it is a bit more complicated than that.

As for your question, you didn't build two models. You built one model and then computed two sets of contrasts. The first set of contrasts just computes the various possible comparisons (e.g., you compute the fold changes between all possible pairs). In the second set of contrasts you do the same as the first, but you also compute three things that are not technically contrasts (A, B, C), which are just returning the mean expression for each group.

When you compute a contrast, you are in essence computing a t-statistic where the numerator is some value (usually the difference in means between two groups, but sometimes a slope), and the denominator is some measure of variability, used like a yardstick to say if the numerator is 'big' or not. The goal is to see if the numerator is arguably different from zero. In the case of the first set of contrasts, this makes sense. If the numerator is distinguishable from zero, then you have evidence that there is a difference in expression between the two sample groups.

But for the (A, B, C) contrasts, the numerator is the mean expression for that sample type, which in ALL cases will be much larger than zero. So when you compute the t-statistic, the resulting p-value will be exceedingly small for all genes, indicating that you have good evidence that the average expression value is different from zero. Which is a trivial, and uninteresting thing to test.

score 0 · Answer 3 · 2015-05-20

hi, Aaron:

Thanks for your inputs, truly helpful... I tried to get coefficient by

coef <- fit2.eBayes$coefficients

            Contrasts
                    s.vs.ns      c.vs.ns       c.vs.s
  1007_s_at    -0.112430484 -0.025126351  0.087304133
  1053_at       0.041222381  0.108865910  0.067643529
  117_at       -0.255047744 -0.005920597  0.249127147
  121_at       -0.117165262 -0.151033793 -0.033868531
  1255_g_at     0.024507816 -0.052362706 -0.076870522

so it is not the average expression value I intended to get... as an alternative... from posts I read before, can I construct contrast as:

contrast.matrix<-makeContrasts(A, B, C, c.vs.a=C-A, b.vs.a=B-A, c.vs.b=C-B, levels=design)

I guess that will trick the program to generate coefficient for A, B, C alone... I tested it and it is the same as result I got using rowMeans

I did check for results, using two different model, for c.vs.a,

I got the identical results...... but only for the logFC, the P value are all way off reality in the model with A, B, C built in

Any one can chime in for this observation and behavior of LIMMA?

Thanks

score 0 · Answer 4 · 2015-05-20

I think you may have misunderstood what Aaron was telling you. If you fit a model the way you have, there are three steps:

fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)

In the first step you are computing the means of each group (given the way you specify your design matrix). The next two steps compute the contrasts and then make comparisons using the empirical Bayes adjusted statistics. So when Aaron was saying the means are in fit$coefficients, he was talking about the fit object, not the fit2 object.

I should also caution you about the usefulness of the mean expression values. These values are not really useful by themselves, as they are some unknown combination of mRNA abundance and other technical artifacts. The only interpretable quantity here is the fold change between sample groups. In other words, you have to consider where these data come from. The expression values are highly processed data that started out as fluorescence intensities from cDNA that hybridized to 25-mers on a silicon wafer. While some component of that signal is correlated to the starting amount of mRNA, some unknown (and depending on lots of factors, possibly large) component of the signal has nothing whatsoever to do with the starting amount of mRNA.

As an example, here is a plot of the expression levels for the antigenomic probes from the Mouse 2.0 ST array. Note that these are probes that are complementary to sequences that do not exist in mice, so any binding is due to cross-hybridization of non-complementary cDNA fragments. These probes range from all AT to pure GC, so the probes to the left have only A and T nucleotides, and those on the right are all G and C nucleotides, and those in the middle have some combination of the four.

So given the GC content of a particular probe, some non-negligible amount of binding is due to cross-hybridization with unrelated sequences. And since the expression values are based on a set of 25-mers that are likely to have varying amounts of GC content, it would be very difficult to say how much of the expression value you get is signal and how much is noise.

score 0 · Answer 5 · 2015-05-20

Thanks, James... I think I still need to get deeper understanding of limma and how I should harness its power in the right way... so far I am impressed with its speed and convenience, I can see once I master it, it will make my life a lot easier... kudos to the builders of limma!

But I have to argue against your claim average value for a group is irrelevant, given we will be comparing the SAME probesets across different biological conditions, so even if GC content does affect the APPARENT gene expression value, it does enable us to compare across different conditions for the same gene -- just as you said, what matters is the P value and its fold change...

My experience with RMA normalized microarry data, with many of them publicly available on GEO... generally speaking, anything under than 5 are mostly noise, and the platform is saturated when value reaches around 11 or 12... based on know biology and comparing to NGS data... and this is why I want to know the average expression of a probeset in a particular biological condition or group

By the way, could you explain why the two models I build:

contrast.matrix<-makeContrasts(c.vs.a=C-A, b.vs.a=B-A, c.vs.b=C-B, levels=design)

or

contrast.matrix<-makeContrasts(A, B, C, c.vs.a=C-A, b.vs.a=B-A, c.vs.b=C-B, levels=design)

generate the same fold change, but the first one seem to have the right p value but the second one has crazy p values ? (all gene are taken as extremely significant in the second model)

And could limma be changed to generate average expression value by default for different categories?