Search
Question: wrong output of DNAcopy::segment() for chr. with empty copy number data
0
11 months ago by
Alexandre Kuhn20 wrote:

Hello,

I encountered a rare problem with segment() in the DNAcopy package. Specifically I see artifactual segments with ouptut$loc.start greater than output$end.start and output$num.mark=0 and output$seg.mean=NA.

In my case, it seems to be caused by "chromosomes" (contigs) with empty copy number data point (NAs). Here is a minimal example to try and illustrate the problem:

> segment(CNA(genomdat=c(1,2,1,NA,3),chrom=c(rep("chr1",3),"chr2","chr3"),maploc=c(1:3,1,1)))

Analyzing: Sample.1

Call:

segment(x = CNA(genomdat = c(1, 2, 1, NA, 3), chrom = c(rep("chr1",

3), "chr2", "chr3"), maploc = c(1:3, 1, 1)))

ID chrom loc.start loc.end num.mark seg.mean

1 Sample.1  chr1         1       3        3   1.3333

2 Sample.1  chr1         1       3        0       NA

3 Sample.1  chr3         1       1        1   3.0000

So the second segment in the output is assigned to chr1 and has num.mark of 0. It should be omitted I believe. SegRows in segment() output would also show a problem as startRow is greater than endRow for that second segment (with real data I have also seen examples where, in addition, loc.start would be greater than loc.end, which is not the case here).

A simple workaround is to eliminate segments with num.mark=0 and seg.mean=NA. However, it would be great to fix the segment() function directly (if needed). Thanks!

> sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: CentOS release 6.8 (Final)

Matrix products: default

BLAS: /home/akuhn/R/R-3.4.1/lib/libRblas.so

LAPACK: /home/akuhn/R/R-3.4.1/lib/libRlapack.so

locale:

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=en_US.UTF-8       LC_NAME=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

[1] DNAcopy_1.50.1

loaded via a namespace (and not attached):

[1] compiler_3.4.1

modified 11 months ago • written 11 months ago by Alexandre Kuhn20
0
11 months ago by
markus.riester90 wrote:

0
11 months ago by
Alexandre Kuhn20 wrote:

It does not seem to be true. segment() handles NAs correctly if there are other, non-NA values on the same chromosome:

> segment(CNA(genomdat=c(1,2,1,NA,2,3),chrom=c(rep("chr1",3),rep("chr2",2),"chr3"),maploc=c(1:3,1:2,1)))

Analyzing: Sample.1

Call:

segment(x = CNA(genomdat = c(1, 2, 1, NA, 2, 3), chrom = c(rep("chr1",

3), rep("chr2", 2), "chr3"), maploc = c(1:3, 1:2, 1)))

ID chrom loc.start loc.end num.mark seg.mean

1 Sample.1  chr1         1       3        3   1.3333

2 Sample.1  chr2         2       2        1   2.0000

3 Sample.1  chr3         1       1        1   3.0000

However, the (problematic) case where read count data are NAs for the entire contig length does occur in real world analyses (and, in particular these days with draft genome assemblies that may have numerous, smaller contigs). It would thus be helpful to users if segment() would handle this case properly.

I stand corrected. I'm sure though there are other corner cases where it doesn't work with NAs, since I've stumbled across this "negative segment sizes with NAs" a few times before. But you are probably right that this is a bug worth fixing.