Question

wrong output of DNAcopy::segment() for chr. with empty copy number data

0

Entering edit mode

Alexandre Kuhn ▴ 20

@alexandre-kuhn-4386

Last seen 3.8 years ago

Hello,

I encountered a rare problem with segment() in the DNAcopy package. Specifically I see artifactual segments with ouptut$loc.start greater than output$end.start and output$num.mark=0 and output$seg.mean=NA.

In my case, it seems to be caused by "chromosomes" (contigs) with empty copy number data point (NAs). Here is a minimal example to try and illustrate the problem:

> segment(CNA(genomdat=c(1,2,1,NA,3),chrom=c(rep("chr1",3),"chr2","chr3"),maploc=c(1:3,1,1)))

Analyzing: Sample.1 

Call:

segment(x = CNA(genomdat = c(1, 2, 1, NA, 3), chrom = c(rep("chr1", 

    3), "chr2", "chr3"), maploc = c(1:3, 1, 1)))


        ID chrom loc.start loc.end num.mark seg.mean

1 Sample.1  chr1         1       3        3   1.3333

2 Sample.1  chr1         1       3        0       NA

3 Sample.1  chr3         1       1        1   3.0000

So the second segment in the output is assigned to chr1 and has num.mark of 0. It should be omitted I believe. SegRows in segment() output would also show a problem as startRow is greater than endRow for that second segment (with real data I have also seen examples where, in addition, loc.start would be greater than loc.end, which is not the case here).

A simple workaround is to eliminate segments with num.mark=0 and seg.mean=NA. However, it would be great to fix the segment() function directly (if needed). Thanks!

> sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: CentOS release 6.8 (Final)


Matrix products: default

BLAS: /home/akuhn/R/R-3.4.1/lib/libRblas.so

LAPACK: /home/akuhn/R/R-3.4.1/lib/libRlapack.so


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     


other attached packages:

[1] DNAcopy_1.50.1


loaded via a namespace (and not attached):

[1] compiler_3.4.1

DNAcopy dnacopy • 1.4k views

ADD COMMENT • link 6.7 years ago Alexandre Kuhn ▴ 20

score 0 · Answer 1 · 2017-08-18

0

Entering edit mode

markus.riester ▴ 130

@markusriester-9875

Last seen 22 months ago

United States

I think you answered your question, segment() doesn't like NAs anywhere.

ADD COMMENT • link 6.7 years ago markus.riester ▴ 130

score 0 · Answer 2 · 2017-08-21

It does not seem to be true. segment() handles NAs correctly if there are other, non-NA values on the same chromosome:

> segment(CNA(genomdat=c(1,2,1,NA,2,3),chrom=c(rep("chr1",3),rep("chr2",2),"chr3"),maploc=c(1:3,1:2,1)))

Analyzing: Sample.1 

Call:

segment(x = CNA(genomdat = c(1, 2, 1, NA, 2, 3), chrom = c(rep("chr1", 

    3), rep("chr2", 2), "chr3"), maploc = c(1:3, 1:2, 1)))


        ID chrom loc.start loc.end num.mark seg.mean

1 Sample.1  chr1         1       3        3   1.3333

2 Sample.1  chr2         2       2        1   2.0000

3 Sample.1  chr3         1       1        1   3.0000

However, the (problematic) case where read count data are NAs for the entire contig length does occur in real world analyses (and, in particular these days with draft genome assemblies that may have numerous, smaller contigs). It would thus be helpful to users if segment() would handle this case properly.