Search
Question: qpgraph's qpPCC function giving different results than cor.test()
1
21 months ago by
PabloR0
PabloR0 wrote:

Hello,

I'm using qpgraph to calculate Pearson correlation over a set of gene expression values.  Since I don't have a lot of samples I'm only actually using the qpPCC function (which is faster than cor.test), so I'm not really calculating partial correlations and after having the coefficients I'm using other tools downstream.

However, I noticed some discrepancies in the Pearson's r coefficients calculated using qpPCC and using cor.test, which from the manual shouldn't exist. Particularly, this happens when NA values are present in the data.frame being passed to qpPCC. When NA values are present, I get the warning In .Primitive("sqrt")(x@x) : NaNs produced and in my output data.frame (both in the R and P slots) I get slightly different values than what I get when using cor.test.

Example:

> data
VAR1     VAR2       VAR3
SAMPLE1 7.126911 6.961971  8.571791
SAMPLE2 6.954778 6.697107 10.102002
SAMPLE3 6.474923 7.255029  9.294644
SAMPLE4 5.871351 7.952916  4.192983
SAMPLE5 7.215776 7.915700  5.145677
SAMPLE6 7.712733 8.277148  5.845239
SAMPLE7 5.911692 8.036558  3.928844
SAMPLE8 7.463851 7.749802  4.732812
SAMPLE9 6.647602 7.858230  4.060912

The Pearson's correlation between VAR1 and VAR2:

cor.test(data[, 1], data[, 2], method = "pearson")
# this gives me r=-0.0534991 , p-value = 0.8913

Now using qpPCC -- results are identical to cor.test.

> qpPCC(data, long.dim.are.variables=FALSE)
$R 3 x 3 Matrix of class "dspMatrix" VAR1 VAR2 VAR3 VAR1 1.0000000 -0.0534991 0.2250407 VAR2 -0.0534991 1.0000000 -0.8760690 VAR3 0.2250407 -0.8760690 1.0000000$P
3 x 3 Matrix of class "dspMatrix"
VAR1        VAR2         VAR3
VAR1 0.0000000 0.891272149 0.560460408
VAR2  0.8912721 0.000000000 0.001950495
VAR3   0.5604604 0.001950495 0.000000000

Now I add an NA value to my data, which now looks like this:

> data
VAR1     VAR2      VAR3
SAMPLE1 7.126911 6.961971 8.571791
SAMPLE2 6.954778 6.697107       NA
SAMPLE3 6.474923 7.255029 9.294644
SAMPLE4 5.871351 7.952916 4.192983
SAMPLE5 7.215776 7.915700 5.145677
SAMPLE6 7.712733 8.277148 5.845239
SAMPLE7 5.911692 8.036558 3.928844
SAMPLE8 7.463851 7.749802 4.732812
SAMPLE9 6.647602 7.858230 4.060912

And calculate the same correlations between VAR1 x VAR2. First using cor.test:
> cor.test(data[, 1], data[, 2], method = "pearson")

	Pearson's product-moment correlation

data:  data[, 1] and data[, 2]
t = -0.1417, df = 7, p-value = 0.8913
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6929986  0.6331172
sample estimates:
cor
-0.0534991

Now using qpPCC:

> qpPCC(data, long.dim.are.variables=FALSE)
$R 3 x 3 Matrix of class "dspMatrix" VAR1 VAR2 VAR3 VAR1 1.000000000 -0.003023527 0.2232823 VAR2 -0.003023527 1.000000000 -0.7994699 VAR3 0.223282264 -0.799469888 1.0000000$P
3 x 3 Matrix of class "dspMatrix"
VAR1        VAR2         VAR3
VAR1 0.0000000 0.993840568 0.563609481
VAR2  0.9938406 0.000000000 0.009712148
VAR3   0.5636095 0.009712148 0.000000000

And I get a totally different value for the correlations between VAR1 and VAR2 (r=-0.003023527).

I think I'm failing to see the differences between the correlations calculated by qpPCC and using cor.  I thought they should both produce identical results, mostly due to what's written in the help of qpPCC: ("The calculations made by this function are the same as the ones made for a single pair of variables by the function cor.test but for all the pairs of variables in the data set.").

I don't see why the change of value in one variable would alter the correlations between other, 'independent', variables in the dataset (specially since qpPCC shouldn't calculate partial correlations). If someone could shed some light on this I would greatly appreciate specially because qpPCC is so much faster in calculating correlations for big datasets.

All best,

Pablo

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=pt_BR.UTF-8
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocInstaller_1.16.5 qpgraph_2.0.5

loaded via a namespace (and not attached):
[1] annotate_1.44.0         AnnotationDbi_1.28.2    base64enc_0.1-2
[4] BatchJobs_1.6           BBmisc_1.9              Biobase_2.26.0
[7] BiocGenerics_0.12.1     BiocParallel_1.0.3      biomaRt_2.22.0
[10] Biostrings_2.34.1       bitops_1.0-6            brew_1.0-6
[13] checkmate_1.5.3         codetools_0.2-11        DBI_0.5-1
[16] digest_0.6.8            fail_1.2                foreach_1.4.2
[19] GenomeInfoDb_1.2.5      GenomicAlignments_1.2.2 GenomicFeatures_1.18.7
[22] GenomicRanges_1.18.4    graph_1.44.1            grid_3.1.2
[25] IRanges_2.0.1           iterators_1.0.7         lattice_0.20-31
[28] magrittr_1.5            Matrix_1.2-0            mvtnorm_1.0-2
[31] parallel_3.1.2          qtl_1.39-5              RCurl_1.95-4.6
[34] Rgraphviz_2.10.0        Rsamtools_1.18.3        RSQLite_1.0.0
[37] rtracklayer_1.26.3      S4Vectors_0.4.0         sendmailR_1.2-1
[40] stats4_3.1.2            stringi_0.4-1           stringr_1.0.0
[43] tools_3.1.2             XML_3.98-1.1            xtable_1.7-4
[46] XVector_0.6.0           zlibbioc_1.12.0       

modified 21 months ago by Robert Castelo2.2k • written 21 months ago by PabloR0
1
21 months ago by
Robert Castelo2.2k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.2k wrote:

hi,

thanks for bringing this issue up. i'm afraid the documentation of the function qpPCC() is not accurate because it does behave differently to using cor.test() in the presence of missing values. more concretely, qpPCC() uses only complete observations across all variables in the input data. if you try in your example cor.test(data[-2, 1], data[-2, 2]) you'll get the same correlation value and thanks to your report i noticed that there was a mistake in the calculation of the p-value (it was using the original sample size instead of the number of complete observations, so, right now you'll be getting a different p-value). i've submitted a new release version of qpgraph (2.8.3), which should be available in the next 24/48 hrs, which fixes this bug (so correlation and p-values will match for complete observations) and amends the documentation of the qpPCC() function.

thanks again for the bug report.

robert.

Hi Robert, thank you for looking up into this and for the fast fix.  I'm acepting your answer so solution for anyone that encounters this issue will be to update qpgraph to package 2.8.3 or higher.

Again, many thanks.

Pablo