Hello,
I'm using qpgraph to calculate Pearson correlation over a set of gene expression values. Since I don't have a lot of samples I'm only actually using the qpPCC function (which is faster than cor.test), so I'm not really calculating partial correlations and after having the coefficients I'm using other tools downstream.
However, I noticed some discrepancies in the Pearson's r coefficients calculated using qpPCC and using cor.test, which from the manual shouldn't exist. Particularly, this happens when NA values are present in the data.frame being passed to qpPCC. When NA values are present, I get the warning In .Primitive("sqrt")(x@x) : NaNs produced
and in my output data.frame (both in the R and P slots) I get slightly different values than what I get when using cor.test.
Example:
> data VAR1 VAR2 VAR3 SAMPLE1 7.126911 6.961971 8.571791 SAMPLE2 6.954778 6.697107 10.102002 SAMPLE3 6.474923 7.255029 9.294644 SAMPLE4 5.871351 7.952916 4.192983 SAMPLE5 7.215776 7.915700 5.145677 SAMPLE6 7.712733 8.277148 5.845239 SAMPLE7 5.911692 8.036558 3.928844 SAMPLE8 7.463851 7.749802 4.732812 SAMPLE9 6.647602 7.858230 4.060912
The Pearson's correlation between VAR1 and VAR2:
cor.test(data[, 1], data[, 2], method = "pearson") # this gives me r=-0.0534991 , p-value = 0.8913
Now using qpPCC -- results are identical to cor.test.
> qpPCC(data, long.dim.are.variables=FALSE) $R 3 x 3 Matrix of class "dspMatrix" VAR1 VAR2 VAR3 VAR1 1.0000000 -0.0534991 0.2250407 VAR2 -0.0534991 1.0000000 -0.8760690 VAR3 0.2250407 -0.8760690 1.0000000 $P 3 x 3 Matrix of class "dspMatrix" VAR1 VAR2 VAR3 VAR1 0.0000000 0.891272149 0.560460408 VAR2 0.8912721 0.000000000 0.001950495 VAR3 0.5604604 0.001950495 0.000000000
Now I add an NA value to my data, which now looks like this
:
> data VAR1 VAR2 VAR3 SAMPLE1 7.126911 6.961971 8.571791 SAMPLE2 6.954778 6.697107 NA SAMPLE3 6.474923 7.255029 9.294644 SAMPLE4 5.871351 7.952916 4.192983 SAMPLE5 7.215776 7.915700 5.145677 SAMPLE6 7.712733 8.277148 5.845239 SAMPLE7 5.911692 8.036558 3.928844 SAMPLE8 7.463851 7.749802 4.732812 SAMPLE9 6.647602 7.858230 4.060912
And calculate the same correlations between VAR1 x VAR2. First using cor.test:
> cor.test(data[, 1], data[, 2], method = "pearson")
Pearson's product-moment correlation data: data[, 1] and data[, 2] t = -0.1417, df = 7, p-value = 0.8913 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.6929986 0.6331172 sample estimates: cor -0.0534991
Now using qpPCC:
> qpPCC(data, long.dim.are.variables=FALSE) $R 3 x 3 Matrix of class "dspMatrix" VAR1 VAR2 VAR3 VAR1 1.000000000 -0.003023527 0.2232823 VAR2 -0.003023527 1.000000000 -0.7994699 VAR3 0.223282264 -0.799469888 1.0000000 $P 3 x 3 Matrix of class "dspMatrix" VAR1 VAR2 VAR3 VAR1 0.0000000 0.993840568 0.563609481 VAR2 0.9938406 0.000000000 0.009712148 VAR3 0.5636095 0.009712148 0.000000000
And I get a totally different value for the correlations between VAR1 and VAR2 (r=-0.003023527).
I think I'm failing to see the differences between the correlations calculated by qpPCC and using cor. I thought they should both produce identical results, mostly due to what's written in the help of qpPCC: ("The calculations made by this function are the same as the ones made for a single pair of variables by the function cor.test
but for all the pairs of variables in the data set.").
I don't see why the change of value in one variable would alter the correlations between other, 'independent', variables in the dataset (specially since qpPCC shouldn't calculate partial correlations). If someone could shed some light on this I would greatly appreciate specially because qpPCC is so much faster in calculating correlations for big datasets.
All best,
Pablo
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pt_BR.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.16.5 qpgraph_2.0.5 loaded via a namespace (and not attached): [1] annotate_1.44.0 AnnotationDbi_1.28.2 base64enc_0.1-2 [4] BatchJobs_1.6 BBmisc_1.9 Biobase_2.26.0 [7] BiocGenerics_0.12.1 BiocParallel_1.0.3 biomaRt_2.22.0 [10] Biostrings_2.34.1 bitops_1.0-6 brew_1.0-6 [13] checkmate_1.5.3 codetools_0.2-11 DBI_0.5-1 [16] digest_0.6.8 fail_1.2 foreach_1.4.2 [19] GenomeInfoDb_1.2.5 GenomicAlignments_1.2.2 GenomicFeatures_1.18.7 [22] GenomicRanges_1.18.4 graph_1.44.1 grid_3.1.2 [25] IRanges_2.0.1 iterators_1.0.7 lattice_0.20-31 [28] magrittr_1.5 Matrix_1.2-0 mvtnorm_1.0-2 [31] parallel_3.1.2 qtl_1.39-5 RCurl_1.95-4.6 [34] Rgraphviz_2.10.0 Rsamtools_1.18.3 RSQLite_1.0.0 [37] rtracklayer_1.26.3 S4Vectors_0.4.0 sendmailR_1.2-1 [40] stats4_3.1.2 stringi_0.4-1 stringr_1.0.0 [43] tools_3.1.2 XML_3.98-1.1 xtable_1.7-4 [46] XVector_0.6.0 zlibbioc_1.12.0
Hi Robert, thank you for looking up into this and for the fast fix. I'm acepting your answer so solution for anyone that encounters this issue will be to update qpgraph to package 2.8.3 or higher.
Again, many thanks.
Pablo