Reproducibility of DNAcopy segmentation
1
0
Entering edit mode
@ross-patterson-3886
Last seen 10.2 years ago
While performing some copy number analysis on data segmented with the DNAcopy package, I have noticed some variations in the output data, and was hoping someone here could help shed some light on that. Specifically, while running the DNAcopy segmentation on the exact same input data multiple times, I have noticed that the resultant segment data output sometimes contains "extra" segments, caused by the discovery of "extra" breakpoints. In fact, the resultant output data is always different. Digging into the source code a little bit, I saw what appeared to be calls to some random number generating functions, although not being very familiar with Fortran code I could not tell how or why these numbers were being used, or even if that is the source of segmentation discrepancies. I know that in the last few years there have been some changes to the segmentation algorithm to allow it to run in near linear time. Did that require introducing non-deterministic behavior? Is there a way to force the segmentation algorithm to run deterministically, such that the output data can be identically reproduced every time the segmentation is run? Thank you in advance for your help, Ross Patterson [[alternative HTML version deleted]]
DNAcopy DNAcopy • 1.2k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Wed, Jan 13, 2010 at 1:42 PM, Ross Patterson <rossjp at="" gmail.com=""> wrote: > While performing some copy number analysis on data segmented with the > DNAcopy package, I have noticed some variations in the output data, and was > hoping someone here could help shed some light on that. ?Specifically, while > running the DNAcopy segmentation on the exact same input data multiple > times, I have noticed that the resultant segment data output sometimes > contains "extra" segments, caused by the discovery of "extra" breakpoints. > In fact, the resultant output data is always different. ?Digging into the > source code a little bit, I saw what appeared to be calls to some random > number generating functions, although not being very familiar with Fortran > code I could not tell how or why these numbers were being used, or even if > that is the source of segmentation discrepancies. ?I know that in the last > few years there have been some changes to the segmentation algorithm to > allow it to run in near linear time. ?Did that require introducing > non-deterministic behavior? ?Is there a way to force the segmentation > algorithm to run deterministically, such that the output data can be > identically reproduced every time the segmentation is run? Hi, Ross. DNAcopy uses an empirical distribution for determining significance. The help for segment() gives some details. The authors can perhaps comment on whether or not there is a way to make things run deterministically. Sean
ADD COMMENT
0
Entering edit mode
Thanks Sean. Somehow I had not seen that documentation before. I will attempt setting the seed values and report back. Ross On Wed, Jan 13, 2010 at 2:00 PM, Sean Davis <seandavi@gmail.com> wrote: > On Wed, Jan 13, 2010 at 1:42 PM, Ross Patterson <rossjp@gmail.com> wrote: > > While performing some copy number analysis on data segmented with the > > DNAcopy package, I have noticed some variations in the output data, and > was > > hoping someone here could help shed some light on that. Specifically, > while > > running the DNAcopy segmentation on the exact same input data multiple > > times, I have noticed that the resultant segment data output sometimes > > contains "extra" segments, caused by the discovery of "extra" > breakpoints. > > In fact, the resultant output data is always different. Digging into the > > source code a little bit, I saw what appeared to be calls to some random > > number generating functions, although not being very familiar with > Fortran > > code I could not tell how or why these numbers were being used, or even > if > > that is the source of segmentation discrepancies. I know that in the > last > > few years there have been some changes to the segmentation algorithm to > > allow it to run in near linear time. Did that require introducing > > non-deterministic behavior? Is there a way to force the segmentation > > algorithm to run deterministically, such that the output data can be > > identically reproduced every time the segmentation is run? > > Hi, Ross. > > DNAcopy uses an empirical distribution for determining significance. > The help for segment() gives some details. The authors can perhaps > comment on whether or not there is a way to make things run > deterministically. > > Sean > -- ross [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6