Question

Correct input data for avereps function (for averaging multiple probes for same gene)

0

Entering edit mode

Naren C S • 0

@naren-c-s-23517

Last seen 4.4 years ago

One of the ways to deal with the presence of multiple probes for the same gene in a microarray is to take the average of the expression value of the different probes. This would provide a single representative expression value for that gene.

The avereps function in limma can be used for this purpose. The Note section for the avereps function from the limma reference manual states the following (emphasis added by me):

This function should only be applied to normalized log-expression values, and not to raw unlogged expression values. It will generate an error message if applied to RGList or EListRaw objects.

Would it not be more accurate to take the mean of normalized base10-expression values (and then convert to log2) rather than take the mean of the normalized log2-expression values as suggested by the note?

limma • 784 views

ADD COMMENT • link updated 4.4 years ago by James W. MacDonald 67k • written 4.4 years ago by Naren C S • 0

score 1 · Answer 1 · 2020-05-13

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

As the more mathy types tend to say, logs are the same, up to a multiplicative constant. So it doesn't matter, except for interpretation. As an example:

## just get some data that needs to be log converted
> library(Biobase)
> data(sample.ExpressionSet)
> exprs(sample.ExpressionSet)[1,]
       A        B        C        D        E        F        G        H 
192.7420  85.7533 176.7570 135.5750  64.4939  76.3569 160.5050  65.9631 
       I        J        K        L        M        N        O        P 
 56.9039 135.6080  63.4432  78.2126  83.0943  89.3372  91.0615  95.9377 
       Q        R        S        T        U        V        W        X 
179.8450 152.4670 180.8340  85.4146 157.9890 146.8000  93.8829 103.8550 
       Y        Z 
 64.4340 175.6150 
> mean(log2(exprs(sample.ExpressionSet)[1,]))
[1] 6.739028
> mean(log10(exprs(sample.ExpressionSet)[1,]))/log10(2)
[1] 6.739028

The convention of using log base 2 is because a two-fold difference was originally (in the dark ages of like 2001 or whatever) considered to be a 'biologically meaningful' difference, so a change of 1 or -1 is easily interpretable in that context. But otherwise I am not sure it matters at all what base you use.

ADD COMMENT • link 4.4 years ago James W. MacDonald 67k

0

Entering edit mode

I was asking more about the difference between using normalized non-log values vs normalized log values. In the original post, by base10, I meant non-log (sorry for the confusion about that).

To illustrate with the example that you have shown, say we have two matrices e1 and e2, the first of which is non-log and the latter which is in log2

library(Biobase)
library(GEOquery) # This is used to get the GPL file to map probe ids to entrez ids
library(limma)
data(sample.ExpressionSet)
e1 <- exprs(sample.ExpressionSet)
e2 <- log2(exprs(sample.ExpressionSet))

# This block gets the entrez ids that the probes map to
map_df   <- Table(getGEO("GPL8300")) # GPL8300 is the hgu95av2 array
gene_ids <- map_df$ENTREZ_GENE_ID[match(rownames(e1), map_df$ID)]

a1 <- avereps(e1, gene_ids)
a2 <- avereps(e2, gene_ids)

# Taking an example of the gene '6772', which has multiple probesets
head(log2(a1["6772",]))
       A        B        C        D        E        F 
8.037998 9.729506 8.071669 7.575657 8.356090 7.883768 

# a2 is already in log2 form
head(a2["6772",])
       A        B        C        D        E        F 
7.805825 9.432523 7.787864 7.268751 7.816254 7.504350

According to the note in the limma reference value, it expects e2 as the input because it states:

This function should only be applied to normalized log-expression values, and not to raw unlogged expression values. It will generate an error message if applied to RGList or EListRaw objects.

But wouldn't it make more sense to give e1 as the input and then log2 it after the averaging?

ADD REPLY • link 4.4 years ago Naren C S • 0

2

Entering edit mode

No, it doesn't make sense to do that. Raw microarray intensity values have a right skew, and since the mean is not robust to outliers you want to take logs first to minimize that.

Anyway, for Affy data the convention is to take logs using the probe values, normalize, then use medianpolish to get the probeset expression values (that's the RMA algorithm). So you should normally be starting with normalized, log base 2 values. You would have to do something inadvisable to have unlogged expression values in the first place (the sample.ExpressionSet data are, IIRC MAS5.0 summarized data, and nobody has used MAS5.0 since maybe 2002 when Rafa showed it was bunk).