Question

rounding issues using the VSN package

0

Entering edit mode

aurelien ▴ 10

@aurelien-8289

Last seen 8.3 years ago

Luxembourg

Dear all,

I am using the R package VSN for normalising proteomics data (abundances) that then goes in limma for contrast analysis. I know that VSN was not designed to work with proteomics data but actually vsn2 is doing a great job and clearly stabilised the variance, see below

The issue I have, if that the results are slightly different depending on which platform I am running the R code.

Example MacOS

        logFC  AveExpr         t      P.Value   adj.P.Val           B

1   1.5389102 12.82511  7.730600 2.590139e-06 0.002297454  4.74513335
2   1.7838522 17.67135  5.358662 1.149218e-04 0.047924259  1.44324118
3   1.0633257 13.62943  5.164849 1.620888e-04 0.047924259  1.13231059
4   1.5435723 12.16677  4.682596 3.896566e-04 0.086406353  0.33353763
5  -1.0275168 12.40099 -4.508445 5.386750e-04 0.095560937  0.03691621

Normalized abundances

     control_24_3 control_24_4 control_24_5 control_24_6 control_24_7

[1,]     13.07592     12.64651     12.24371     12.45800     12.00618
[2,]     13.44991     14.09624     14.22835     13.85055     14.05933
[3,]     16.24021     16.82461     16.19956     16.47579     17.68974
[4,]     10.97001     10.35783     11.89993     10.56033     11.50507
[5,]     14.25346     14.69884     14.11004     14.68608     13.98822

Debian 7

        logFC  AveExpr         t      P.Value   adj.P.Val           B

1   1.5389434 12.82507  7.730738 2.589577e-06 0.002296955  4.74535815
2   1.7838631 17.67135  5.358692 1.149147e-04 0.047919273  1.44330565
3   1.0633463 13.62941  5.164902 1.620719e-04 0.047919273  1.13241108
4   1.5435961 12.16673  4.682718 3.895658e-04 0.086386209  0.33375116
5  -1.0275537 12.40094 -4.508525 5.385908e-04 0.095546006  0.03705773

the normalised abundances look slightly different

     control_24_3 control_24_4 control_24_5 control_24_6 control_24_7

[1,]     13.07588     12.64647     12.24365     12.45795     12.00612
[2,]     13.44988     14.09622     14.22836     13.85052     14.05931
[3,]     16.24018     16.82459     16.19959     16.47577     17.68974
[4,]     10.96996     10.35768     11.89984     10.56020     11.50499
[5,]     14.25343     14.69882     14.11004     14.68606     13.98820

When the normalisation is disabled the results are identical on both GNU/Linux and MacOS. That’s why I am suspecting a rounding issue since the abundances are integers when they go in limma. However, the t-tests, logFC and avgExp in limma are completely wrong as the data have a huge variance.

I asked this question to Wolfgang Huber who advice me to check the point FAQ 7.31 (http://cran.r-project.org/doc/FAQ/R-FAQ.html ) and pointe out that VSN is using an iterative algorithm for a nonlinear optimisation.

For info, under MacOS:

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin14.3.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] qvalue_2.0.0         limma_3.24.10        vsn_3.36.0           Biobase_2.28.0       BiocGenerics_0.14.0  argparse_1.0.1      
 [7] proto_0.3-10         magrittr_1.5         dplyr_0.4.2          tidyr_0.2.0          reshape2_1.4.1       BiocInstaller_1.18.3
loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6           plyr_1.8.3            tools_3.2.1           zlibbioc_1.14.0       digest_0.6.8         
 [6] preprocessCore_1.30.0 gtable_0.1.2          lattice_0.20-31       DBI_0.3.1             findpython_1.0.1     
[11] stringr_1.0.0         grid_3.2.1            getopt_1.20.0         R6_2.0.1              ggplot2_1.0.1        
[16] scales_0.2.5          MASS_7.3-40           splines_3.2.1         assertthat_0.1        colorspace_1.2-6     
[21] stringi_0.5-2         affy_1.46.1           munsell_0.4.2         rjson_0.2.15          affyio_1.36.0

Debian, packages and R are of same versions

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)
other attached packages:
 [1] qvalue_2.0.0        limma_3.24.10       vsn_3.36.0         
 [4] Biobase_2.28.0      BiocGenerics_0.14.0 argparse_1.0.1     
 [7] proto_0.3-10        magrittr_1.5        dplyr_0.4.2        
[10] tidyr_0.2.0         reshape2_1.4.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6           BiocInstaller_1.18.3  plyr_1.8.3           
 [4] tools_3.2.1           zlibbioc_1.14.0       digest_0.6.8         
 [7] preprocessCore_1.30.0 gtable_0.1.2          lattice_0.20-31      
[10] DBI_0.3.1             findpython_1.0.1      stringr_1.0.0        
[13] grid_3.2.1            getopt_1.20.0         R6_2.0.1             
[16] ggplot2_1.0.1         scales_0.2.5          MASS_7.3-41          
[19] splines_3.2.1         assertthat_0.1        colorspace_1.2-6     
[22] stringi_0.4-1         affy_1.46.1           munsell_0.4.2        
[25] rjson_0.2.15          affyio_1.36.0

If you have experienced this, and have some comments / suggestions, I will be happy to update, correct my script.

Many thanks in advance,

Aurelien

vsn2 limma • 2.3k views

ADD COMMENT • link updated 10.5 years ago by Wolfgang Huber ★ 13k • written 10.5 years ago by aurelien ▴ 10

score 2 · Accepted Answer · 2015-06-30

Aurelien

you said "However, the t-tests, logFC and avgExp in limma are completely wrong as the data have a huge variance." - but when I look at the output tables from limma that you provide above, they look very similar (with differences that should be of no practical consequence). Same for the normalized abundances. Can you clarify?

It would be good to post an example dataset and R script to reproduce your observations.

I agree that in principle, the output of the package on different platforms should be identical. Now the optimization of the likelihood depends on the Fortran/f2c function lbfgsb (part of base R, in src/appl/lbfgsb.c), whose code involves at least one platform dependent parameter, DBL_EPSILON from #include <float.h>). I haven't gone to the bottom of this, but this, or similar places, could be where system or compiler dependence could creep in. (The C code of vsn itself is pretty straightforward, there I haven't noted a place for platform-dependence, although with different C compilers I find it hard to be sure.)

Btw, the application of vsn to proteomics data is proposed here: http://www.ncbi.nlm.nih.gov/pubmed/20382981

Best wishes

Wolfgang

score 1 · Accepted Answer · 2015-07-04

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 3 months ago

EMBL European Molecular Biology Laborat…

Dear Aurelien

Thank you. There are two issues:

Calling limma directly on the unnormalized abundances returns nonsensical results. This is fine, since this is not what one should do.
You find (as far as I can tell from your posted examples here and at the above github link) very minor differences (of the order 10^-4 on a log₂ scale) between the outputs of vsn on different platforms. These are negligible compared to the inherent noise (measurement uncertainty) in your data. Please let me know if you indeed see differences that are of any consequence. Otherwise let's all spend our time on more productive tasks! See also: https://en.wikipedia.org/wiki/False_precision

Kind regards

Wolfgang

ADD COMMENT • link 10.5 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Wolfgang,

thanks again for your kind help.

I was feeding limma with unnormalised abundances just to narrow down the origin of the discrepancy I was observing. Actually, I have a shiny server that run under debian to share the results internally, while I wrote scripts on mac. That is how I found out.

As the scale of the whole dataset, the kind of difference I found is 722 proteins retained for a given qvalue threshold while on the other platform I got 743. Nevertheless, I will stop wasting you time and stick to one platform to keep the results comparable.

Thanks again, and sorry for the time you spent on this matter.

Best wishes,

Aurélien

ADD REPLY • link 10.5 years ago aurelien ▴ 10

0

Entering edit mode

To try come to a productive end: can you save the (unadjusted) p-values & probe IDs from both platforms into a file, and plot them against each other in a scatterplot (preferably log-scale). Ideally add to your github site. Just to make sure this is a threshold instability and not something more sinister.

ADD REPLY • link 10.5 years ago Wolfgang Huber ★ 13k

1

Entering edit mode

Dear Wolfgang, this is a good idea, the resulting scatter plot looks like:

scatter