Question

NAs with rawCopynumber() and discrepancies with estimated computed in parallel

0

Entering edit mode

kforner • 0

@kforner-7188

Last seen 9.1 years ago

Switzerland

Hello,

First let me thank you for this publicly available and great software that I previously successfully used for genotype calling.
I just started using crlmm for copy number estimation, and did some checks to ensure the stability of estimates , and found some weird results.

I used the publicly available HOMOS dataset (63 affy CEL files) from Hapmap PhaseIII (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/latest_phaseIII_ncbi_b36/plink_format/)

The copy number estimates were extracted using the rawCopynumber() function.

1)

There are consistently 983952 estimates that are NA.

Among which all non SNPs (isSnp == FALSE), + 38337.

More precisely,

ca1=CA(cnset, j = 1)[, 1]
cb1=CB(cnset, j = 1)[, 1]
table(snp_data$isSnp[is.na(ca1)])
# FALSE   TRUE
# 945615  38337

table(snp_data$isSnp[is.na(cb1)])
# TRUE
# 37159

From this I guess that all CAs are NA for NPs, and some CBs are NA, but only for SNPs.

This is absolutely not consistent with example datasets cnSetExample and cnSetExample2 for which there is no NA, but zeros. Thus the totalCopynumber/rawCopynumber are correct only for the examples, since they return NA for all NPs

Note that in the manual, in the example for the copynumberAccessors entry there is this comment:
## note, cb is NA at nonpolymorphic loci

So from what I understand, the probably correct way to compute the copy numbers would be to first set to zero out the CA for the NPs, then to sum CA+CB. They would still be some NAs, corresponding to actual unknown values.
Could you please confirm this ?

2) they are inconsistent results between the parallel and non-parallel computations.

More precisely, when not using parallel computing (doMC not loaded, and thus not registered),

the estimates are quite close (all.equal() , but not identical()), that is somewhat surprising since the computation used the exact same seed.

But when comparing the results compared in parallel with 4 cores, there are a few but high discrepancies:

> d=cn1-cn2

> bads=which(abs(d) > 7)
> bads
[1] 24732232 69185392 76219567
> cn1[bads]
[1] 2.802042 2.924562 1.645865
> cn2[bads]
[1] 10.000000 10.000000  8.881849

Is this something expected, due to the different seeds between parallel processes ?

Best regards,

Karl Forner

P.S.1

script:
library(ff)
library(crlmm)

cels_dir <- '/home/docker/qbr/.datapkgs/qb.affy.hapmap.phaseIII.dataset_proj/qb.affy.hapmap.phaseIII.dataset/inst/extdata/HOMOS'

cel_files <- dir(cels_dir, full.names = TRUE)
batch <- rep("HOMOS", length(cel_files))

ff_dir <- 'homos_crlmm_ff_dir'
dir.create(ff_dir)

ldPath(ff_dir)
ldStatus(TRUE)

cnset <- genotype(cel_files, batch = batch)
cat('genotype DONE\n')

status <- crlmmCopynumber(cnset, MIN.SAMPLES = length(cel_files))
stopifnot(status)
cat('crlmmCopynumber DONE\n')

P.S 2

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics grDevices utils     datasets methods   base

other attached packages:
[1] crlmm_1.24.0          preprocessCore_1.28.0 oligoClasses_1.28.0
[4] ff_2.2-13             bit_1.1-12

loaded via a namespace (and not attached):
[1] affyio_1.34.0        base64_1.1           Biobase_2.26.0
[4] BiocGenerics_0.12.1 BiocInstaller_1.16.1 Biostrings_2.34.1
[7] codetools_0.2-8      ellipse_0.3-8        foreach_1.4.2
[10] GenomeInfoDb_1.2.4   GenomicRanges_1.18.4 grid_3.1.1
[13] illuminaio_0.8.0     IRanges_2.0.1        iterators_1.0.7
[16] lattice_0.20-29      Matrix_1.1-4         matrixStats_0.10.0
[19] mvtnorm_1.0-0        parallel_3.1.1       Rcpp_0.11.3
[22] RcppEigen_0.3.2.2.0 R.methodsS3_1.6.1    S4Vectors_0.4.0
[25] stats4_3.1.1         VGAM_0.9-4           XVector_0.6.0
[28] zlibbioc_1.12.0

crlmm copynumber • 1.4k views

ADD COMMENT • link 9.1 years ago kforner • 0

score 0 · Answer 1 · 2015-03-09

I will answer myself for the record: I got a mail from Rob Scharpf, confirming that it is a bug:

"I will push a fix to bioc. For now, pass the cdfName and perhaps also the genome build you want (hg19 or hg18) through the arguments cdfName and genome, respectively. "

So to fix the problem, it is enough to do:

cnset <- genotype(cel_files, batch = batch, cdfName = 'genomewidesnp6', genome = "hg19")