Search
Question: rounding issues using the VSN package
0
3.1 years ago by
aurelien10
Luxembourg
aurelien10 wrote:

Dear all,

I am using the R package VSN for normalising proteomics data (abundances) that then goes in limma for contrast analysis. I know that VSN was not designed to work with proteomics data but actually vsn2 is doing a great job and clearly stabilised the variance, see below

The issue I have, if that the results are slightly different depending on which platform I am running the R code.

Example MacOS

        logFC  AveExpr         t      P.Value   adj.P.Val           B

1   1.5389102 12.82511  7.730600 2.590139e-06 0.002297454  4.74513335
2   1.7838522 17.67135  5.358662 1.149218e-04 0.047924259  1.44324118
3   1.0633257 13.62943  5.164849 1.620888e-04 0.047924259  1.13231059
4   1.5435723 12.16677  4.682596 3.896566e-04 0.086406353  0.33353763
5  -1.0275168 12.40099 -4.508445 5.386750e-04 0.095560937  0.03691621

Normalized abundances

     control_24_3 control_24_4 control_24_5 control_24_6 control_24_7

[1,]     13.07592     12.64651     12.24371     12.45800     12.00618
[2,]     13.44991     14.09624     14.22835     13.85055     14.05933
[3,]     16.24021     16.82461     16.19956     16.47579     17.68974
[4,]     10.97001     10.35783     11.89993     10.56033     11.50507
[5,]     14.25346     14.69884     14.11004     14.68608     13.98822

Debian 7

        logFC  AveExpr         t      P.Value   adj.P.Val           B

1   1.5389434 12.82507  7.730738 2.589577e-06 0.002296955  4.74535815
2   1.7838631 17.67135  5.358692 1.149147e-04 0.047919273  1.44330565
3   1.0633463 13.62941  5.164902 1.620719e-04 0.047919273  1.13241108
4   1.5435961 12.16673  4.682718 3.895658e-04 0.086386209  0.33375116
5  -1.0275537 12.40094 -4.508525 5.385908e-04 0.095546006  0.03705773

the normalised abundances look slightly different

     control_24_3 control_24_4 control_24_5 control_24_6 control_24_7

[1,]     13.07588     12.64647     12.24365     12.45795     12.00612
[2,]     13.44988     14.09622     14.22836     13.85052     14.05931
[3,]     16.24018     16.82459     16.19959     16.47577     17.68974
[4,]     10.96996     10.35768     11.89984     10.56020     11.50499
[5,]     14.25343     14.69882     14.11004     14.68606     13.98820

When the normalisation is disabled the results are identical on both GNU/Linux and MacOS. That’s why I am suspecting a rounding issue since the abundances are integers when they go in limma.  However, the t-tests, logFC and avgExp in limma are completely wrong as the data have a huge variance.

I asked this question to Wolfgang Huber who advice me to check the point FAQ 7.31 (http://cran.r-project.org/doc/FAQ/R-FAQ.html ) and pointe out that VSN is using an iterative algorithm for a nonlinear optimisation.

For info, under MacOS:

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin14.3.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] qvalue_2.0.0         limma_3.24.10        vsn_3.36.0           Biobase_2.28.0       BiocGenerics_0.14.0  argparse_1.0.1
[7] proto_0.3-10         magrittr_1.5         dplyr_0.4.2          tidyr_0.2.0          reshape2_1.4.1       BiocInstaller_1.18.3
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6           plyr_1.8.3            tools_3.2.1           zlibbioc_1.14.0       digest_0.6.8
[6] preprocessCore_1.30.0 gtable_0.1.2          lattice_0.20-31       DBI_0.3.1             findpython_1.0.1
[11] stringr_1.0.0         grid_3.2.1            getopt_1.20.0         R6_2.0.1              ggplot2_1.0.1
[16] scales_0.2.5          MASS_7.3-40           splines_3.2.1         assertthat_0.1        colorspace_1.2-6
[21] stringi_0.5-2         affy_1.46.1           munsell_0.4.2         rjson_0.2.15          affyio_1.36.0 

Debian, packages and R are of same versions

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)
other attached packages:
 [1] qvalue_2.0.0        limma_3.24.10       vsn_3.36.0         
 [4] Biobase_2.28.0      BiocGenerics_0.14.0 argparse_1.0.1     
 [7] proto_0.3-10        magrittr_1.5        dplyr_0.4.2        
[10] tidyr_0.2.0         reshape2_1.4.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6           BiocInstaller_1.18.3  plyr_1.8.3           
 [4] tools_3.2.1           zlibbioc_1.14.0       digest_0.6.8         
 [7] preprocessCore_1.30.0 gtable_0.1.2          lattice_0.20-31      
[10] DBI_0.3.1             findpython_1.0.1      stringr_1.0.0        
[13] grid_3.2.1            getopt_1.20.0         R6_2.0.1             
[16] ggplot2_1.0.1         scales_0.2.5          MASS_7.3-41          
[19] splines_3.2.1         assertthat_0.1        colorspace_1.2-6     
[22] stringi_0.4-1         affy_1.46.1           munsell_0.4.2        
[25] rjson_0.2.15          affyio_1.36.0        

If you have experienced this, and have some comments / suggestions, I will be happy to update, correct my script.

Aurelien

modified 3.0 years ago by Wolfgang Huber13k • written 3.1 years ago by aurelien10
2
3.1 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Aurelien

you said "However, the t-tests, logFC and avgExp in limma are completely wrong as the data have a huge variance." - but when I look at the output tables from limma that you provide above, they look very similar (with differences that should be of no practical consequence). Same for the normalized abundances. Can you clarify?

It would be good to post an example dataset and R script to reproduce your observations.

I agree that in principle, the output of the package on different platforms should be identical. Now the optimization of the likelihood depends on the Fortran/f2c function lbfgsb (part of base R, in src/appl/lbfgsb.c), whose code involves at least one platform dependent parameter, DBL_EPSILON from #include <float.h>). I haven't gone to the bottom of this, but this, or similar places, could be where system or compiler dependence could creep in. (The C code of vsn itself is pretty straightforward, there I haven't noted a place for platform-dependence, although with different C compilers I find it hard to be sure.)

Btw, the application  of vsn to proteomics data is proposed here: http://www.ncbi.nlm.nih.gov/pubmed/20382981

Best wishes

Wolfgang

Dear Wolfgang,

thanks again for your help. In the tables I first provided, both were processed with vsn2, the numbers are indeed very close, but at the scale of the whole dataset, when we applied a cut-off on the q-values we did not retain the same number of proteins depending on the platform.

When we don't normalized, the limma output is just weird to me. For example

        logFC   AveExpr         t     P.Value adj.P.Val         B    qvalue
1    3841.857  4175.768  3.862776 0.002614944 0.2276979 -4.595039 0.1868081
2   19976.029 15492.102  3.507738 0.004864415 0.2276979 -4.595047 0.1868081
3   51863.602 98586.161  3.180047 0.008705247 0.2276979 -4.595055 0.1868081
4    3045.267  5924.546  3.154732 0.009107918 0.2276979 -4.595056 0.1868081

A test case code, together with a dataset sample are available at github here https://github.com/ginolhac/vsn_proteomics for anyone who wants to try it out.

Best wishes,

Aurelien

1
3.0 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Dear Aurelien

Thank you. There are two issues:

• Calling limma directly on the unnormalized abundances returns nonsensical results. This is fine, since this is not what one should do.
• You find (as far as I can tell from your posted examples here and at the above github link) very minor differences (of the order 10-4 on a log2 scale) between the outputs of vsn on different platforms. These are negligible compared to the inherent noise (measurement uncertainty) in your data. Please let me know if you indeed see  differences that are of any consequence. Otherwise let's all spend our time on more productive tasks! See also: https://en.wikipedia.org/wiki/False_precision

Kind regards

Wolfgang

Dear Wolfgang,

thanks again for your kind help.

I was feeding limma with unnormalised abundances just to narrow down the origin of the discrepancy I was observing. Actually, I have a shiny server that run under debian to share the results internally, while I wrote scripts on mac. That is how I found out.

As the scale of the whole dataset, the kind of difference I found is 722 proteins retained for a given qvalue threshold while on the other platform I got 743. Nevertheless, I will stop wasting you time and stick to one platform to keep the results comparable.

Thanks again, and sorry for the time you spent on this matter.

Best wishes,

Aurélien

To try come to a productive end: can you save the (unadjusted) p-values & probe IDs from both platforms into a file, and plot them against each other in a scatterplot (preferably log-scale). Ideally add to your github site. Just to make sure this is a threshold instability and not something more sinister.

1

Dear Wolfgang, this is a good idea, the resulting scatter plot looks like:

You might be interested in the density of the difference between the two qvalues

the code was added to github. I am sorry, I cannot upload the full results file as they are not mine and part of a PhD study.

I don't think there is a sinister, so should be safe to let it as it is and stick to one platform.

Edit: