edgeR: AveLogCPM after estimateGLMCommonDisp
2
0
Entering edit mode
David R ▴ 90
@david-rengel-6321
Last seen 4 months ago
European Union

Hi,

I understand that, under edgeR, the dge$AveLogCPM after estimateGLMCommonDisp gives the same result as doing aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors). Hence my question would be: if the aveLogCPM function implements in this case the effective library size, as opposed to the default colSums(data), why then it is not implemented the dge$common.dispersion as well, instead of the 0.05 default value?

Thanks in advance for your reply,

David Rengel

rnaseq edger • 2.1k views
ADD COMMENT
0
Entering edit mode

Hi Aaron,

Thanks for your reply, but I am not sure thiat is the case cause if I run dge$AveLogCPM, it gives me exactly the same result as aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors), but different from AveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors, dispersion = dge$common.dispersion).

Am I missing something here?

Thanks again.

ADD REPLY
0
Entering edit mode

Technicalities aside (you can't run dge$AveLogCPM, it's not a function), the DGEList method for estimateGLMCommonDisp first calculates the average abundances with the default dispersion of 0.05 because it doesn't yet have the common dispersion to use in the calculations. These average abundances are only used for subsetting to improve efficiency, and then they get re-calculated with the common dispersion for output. I'm not sure how you managed to get average abundances from estimateGLMCommonDisp that don't use the common dispersion estimate.

Edit: After some thought, it seems that you're probably using an old version of edgeR, in which the S3 method for estimateGLMCommonDisp hadn't been updated to re-calculate the average log-CPM with the common dispersion. So if you update to the latest version of edgeR, all these discrepancies should be resolved.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

What you're missing with aveLogCPM() is that you're not passing the correct data object to it.

If you compute

A <- aveLogCPM(dge)

and 'dge' is a DGEList object, then the computation will use both the effective library sizes and the common dispersion. In other words, the function does exactly what you expect.

If however you compute

A <- aveLogCPM(data)

and 'data' is just a matrix of counts, how would you expect the function to know what the common dispersion is?

Your original question though relates to the behaviour of the edgeR dispersion estimation functions such as estimateGLMCommonDisp(). None of the edgeR dispersion estimation functions update the dge$aveLogCPM vector as the dispersions are updated. This is a deliberate decision. Instead we compute the dge$aveLogCPM vector right at the beginning of the pipelines, before the common dispersion has been estimated, and then keep it fixed. It would be just too confusing for this vector to keep changing.

The dge$aveLogCPM vector is used for computing trends, for MD plots and so on. By using the default dispersion=0.05, edgeR is already finessing this measure more than other packages do. In principle we could recompute the aveLogCPM component of the DGEList object every time new dispersions are estimated. We could do that for trended and tagwise dispersions as well as when a new common dispersion is estimated. But the potential gain that might be had from updating the aveLogCPM component to match each new estimated dispersion is too small to justify the extra complication and possible confusion that would arise from continually changing it.

ADD COMMENT
0
Entering edit mode

Hi Gordon, 

Thanks for your reply. I understand what you say about not recalculating AveLogcpm with every subsequent dispersion, and I am happy with it.

However, as for my original question, I still do have my doubts. I agree that running AveLogcpm on the original data cannot use the common dispersion cause it has obviously not been calculated yet, and that is my whole point:

A <- dge.$AveLogCPM)

We agree that in order to have A, I should expect that effective library sizes and common dispersions values would be used.

However, it is seemingly not the case since the very same value A is obtained if I run the line below, in which no common dispersion value is given.

aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors)

In fact, a different value from A, let’s say  B, is obtained if I implement the common dispersion value. of dge. That is:

aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors, dispersion = dge$common.dispersion)

Moreover, and I did this just in order to verify that this was the case, the same value B is obtained if I use dge instead of data on the line above.

I thank you in advance for your comments on this. Apart form my daily work,  I would like to provide some training on RNASeq differential analysis using edgeR and I would like to understand as much as possible about  all possible details of it.

Best

David

ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 7 hours ago
The city by the bay

Actually, if a common dispersion is available in the DGEList object, then that value will be used by aveLogCPM.DGEList - check out the source code. Admittedly, the documentation could have described this better, but it seemed to be a level of technical detail that most users wouldn't care about.

ADD COMMENT
0
Entering edit mode

Hi,

I am sorry to come back to this. I have had a look to the source codes and I see what you meant, when I go through the .DEGlist source codes, it looks OK to me. Nevertheless, I still see an issue in how the $AveLogCPM values are implemented into the dge object. I've tried to be more specific with the lines here below. I'd be very grateful if you could provide me with any help.

Best,

David

dge <- DGEList(counts=mydata, group=conds)
dge <- calcNormFactors(dge)
design=model.matrix(~0+conds)
dge <- estimateGLMCommonDisp(dge,design)

head(dge$AveLogCPM)
[1]  4.2177176  4.1126002  1.7762333  6.2465661 -0.8546385  6.1809900
head(aveLogCPM(mydata, lib.size = dge$samples$lib.size*dge$samples$norm.factors))
[1]  4.2177176  4.1126002  1.7762333  6.2465661 -0.8546385  6.1809900

head(aveLogCPM(mydata, lib.size = dge$samples$lib.size*dge$samples$norm.factors, dispersion = dge$common.dispersion))
[1]  4.1910219  4.1143934  1.7938440  6.2459912 -0.8398819  6.1842129
head(aveLogCPM(dge, lib.size = dge$samples$lib.size*dge$samples$norm.factors))
[1]  4.1910219  4.1143934  1.7938440  6.2459912 -0.8398819  6.1842129
head(aveLogCPM(dge, lib.size = dge$samples$lib.size*dge$samples$norm.factors, dispersion = dge$common.dispersion))
[1]  4.1910219  4.1143934  1.7938440  6.2459912 -0.8398819  6.1842129
ADD REPLY
0
Entering edit mode

Are you sure you're running the latest version of edgeR? For me, if I mock up something like this:

mydata <- matrix(rnbinom(1000, mu=100, size=10), ncol=10)
conds <- rep(LETTERS[1:2], each=5)

... and then run your code:

library(edgeR)
dge <- DGEList(counts=mydata, group=conds)
dge <- calcNormFactors(dge)
design=model.matrix(~0+conds)
dge <- estimateGLMCommonDisp(dge,design)

... and then have a look at the aveLogCPM output:

head(dge$AveLogCPM)
# [1] 13.35735 13.24709 13.16178 13.34324 13.38505 13.44232
head(aveLogCPM(mydata, lib.size = dge$samples$lib.size*dge$samples$norm.factors))
# [1] 13.35707 13.24721 13.16139 13.34278 13.38515 13.44159
head(aveLogCPM(mydata, lib.size = dge$samples$lib.size*dge$samples$norm.factors,
               dispersion = dge$common.dispersion))
# [1] 13.35735 13.24709 13.16178 13.34324 13.38505 13.44232

So you can see that dge$AveLogCPM is, in fact, calculated using the common dispersion. Anyway, session info:

R version 3.3.1 Patched (2016-08-05 r71041)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.14.0  limma_3.28.21
ADD REPLY
0
Entering edit mode

Hi Aaron

Thanks for your reply. I run update.packages last Thursday, hence it should be fine.. I'll have a look when I get back to my desk and my R session and I'll let you know. Thanks again.

David

ADD REPLY
0
Entering edit mode

biocLite() should be used for version management, see http://bioconductor.org/install/#why-biocLite.

ADD REPLY
0
Entering edit mode

Hi Aaron,

You were so right. I had a configuration problem with my new computer. Inherited packages from my old laptop, including edgeR, were in one place but updates were being installed somewhere else. Hence, even if I was updating packages, RStudio kept loading the old versions. I've sorted it out and now everything looks fine. I thank you and I apologize of rthe unnecessary hassle.

Best,

David

ADD REPLY

Login before adding your answer.

Traffic: 510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6