Hello,
First let me thank you for this publicly available and great software that I previously successfully used for genotype calling.
I just started using crlmm for copy number estimation, and did some checks to ensure the stability of estimates , and found some weird results.
I used the publicly available HOMOS dataset (63 affy CEL files) from Hapmap PhaseIII (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/latest_phaseIII_ncbi_b36/plink_format/)
The copy number estimates were extracted using the rawCopynumber() function.
1)
There are consistently 983952 estimates that are NA.
Among which all non SNPs (isSnp == FALSE), + 38337.
More precisely,
ca1=CA(cnset, j = 1)[, 1]
cb1=CB(cnset, j = 1)[, 1]
table(snp_data$isSnp[is.na(ca1)])
# FALSE TRUE
# 945615 38337
table(snp_data$isSnp[is.na(cb1)])
# TRUE
# 37159
From this I guess that all CAs are NA for NPs, and some CBs are NA, but only for SNPs.
This is absolutely not consistent with example datasets cnSetExample and cnSetExample2 for which there is no NA, but zeros. Thus the totalCopynumber/rawCopynumber are correct only for the examples, since they return NA for all NPs
Note that in the manual, in the example for the copynumberAccessors entry there is this comment:
## note, cb is NA at nonpolymorphic loci
So from what I understand, the probably correct way to compute the copy numbers would be to first set to zero out the CA for the NPs, then to sum CA+CB. They would still be some NAs, corresponding to actual unknown values.
Could you please confirm this ?
2) they are inconsistent results between the parallel and non-parallel computations.
More precisely, when not using parallel computing (doMC not loaded, and thus not registered),
the estimates are quite close (all.equal() , but not identical()), that is somewhat surprising since the computation used the exact same seed.
But when comparing the results compared in parallel with 4 cores, there are a few but high discrepancies: > d=cn1-cn2 > bads=which(abs(d) > 7) > bads [1] 24732232 69185392 76219567 > cn1[bads] [1] 2.802042 2.924562 1.645865 > cn2[bads] [1] 10.000000 10.000000 8.881849
Is this something expected, due to the different seeds between parallel processes ?
Best regards,
Karl Forner
P.S.1
script: library(ff) library(crlmm) cels_dir <- '/home/docker/qbr/.datapkgs/qb.affy.hapmap.phaseIII.dataset_proj/qb.affy.hapmap.phaseIII.dataset/inst/extdata/HOMOS' cel_files <- dir(cels_dir, full.names = TRUE) batch <- rep("HOMOS", length(cel_files)) ff_dir <- 'homos_crlmm_ff_dir' dir.create(ff_dir) ldPath(ff_dir) ldStatus(TRUE) cnset <- genotype(cel_files, batch = batch) cat('genotype DONE\n') status <- crlmmCopynumber(cnset, MIN.SAMPLES = length(cel_files)) stopifnot(status) cat('crlmmCopynumber DONE\n')
P.S 2
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] crlmm_1.24.0 preprocessCore_1.28.0 oligoClasses_1.28.0
[4] ff_2.2-13 bit_1.1-12
loaded via a namespace (and not attached):
[1] affyio_1.34.0 base64_1.1 Biobase_2.26.0
[4] BiocGenerics_0.12.1 BiocInstaller_1.16.1 Biostrings_2.34.1
[7] codetools_0.2-8 ellipse_0.3-8 foreach_1.4.2
[10] GenomeInfoDb_1.2.4 GenomicRanges_1.18.4 grid_3.1.1
[13] illuminaio_0.8.0 IRanges_2.0.1 iterators_1.0.7
[16] lattice_0.20-29 Matrix_1.1-4 matrixStats_0.10.0
[19] mvtnorm_1.0-0 parallel_3.1.1 Rcpp_0.11.3
[22] RcppEigen_0.3.2.2.0 R.methodsS3_1.6.1 S4Vectors_0.4.0
[25] stats4_3.1.1 VGAM_0.9-4 XVector_0.6.0
[28] zlibbioc_1.12.0