I’m very interested in using CopywriteR to detect copy number variation in target capture sequencing data. I’m looking to select a tool to implement in a high-throughput bioinformatic workflow. Therefore, I won’t be able to manually assess each output plot or IGV file. I was hoping to get some advice as to how I could access the results in a more streamlined way. I assume the per-chromosome plots are generated from data in the segment.Rdata file? Would I need to set a cutoff and determine amplified/depleted loci from that file?
The typical output from a CoywriteR run looks as follows:
-rw-r--r-- 1 t.kuilman domain users 18K Jan 22 14:19 CopywriteR.log -rw-r--r-- 1 t.kuilman domain users 492 Jan 22 14:10 input.Rdata -rw-r--r-- 1 t.kuilman domain users 23M Jan 22 14:10 log2_read_counts.igv drwxr-xr-x 16 t.kuilman domain users 2.0K Jan 22 14:19 plots/ drwxr-xr-x 2 t.kuilman domain users 1.6K Jan 22 14:10 qc/ -rw-r--r-- 1 t.kuilman domain users 22M Jan 22 14:08 read_counts.txt -rw-r--r-- 1 t.kuilman domain users 11M Jan 22 14:18 segment.Rdata
You can read in the vignette of the tool (which you can get by typing browseVignette(“CopywriteR”)) what they are in detail.
With regards to copy number profiling, one can in general divide analysis into several stages, each requiring their own tools. These are the actual acquisition of the raw data (this is what CopywriteR does), segmentation of the data (to see whether there are consecutive copy number data points that have a similar value), and finally the ‘calling' of the data, where one determines which regions are gained or lost (and sometimes even an estimation of the actual copy number). The raw data processed by CopywriteR are in the log2_read_counts.igv file, while the segmentation values that are used for plotting can be found in the segment.Rdata file. The per-chromosome plots are based upon the raw (black dots) and segmented (red lines) data.
I presume what you want is to call amplifications and deletions, i.e. perform the third ’shell’ of analysis. You could simply use fixed cutoffs (for instance log2(3 / 2) = 0.5850 for gain and log2(1 / 2) = -1 for loss. Alternatively, you could use a dedicated tool for the actual calling. We have mainly experience with the Bioconductor package CGHcall. This package performs segmentation and calling (the last two steps in the analysis) and therefore you could use the data in the log2_read_counts.igv file as input. Alternatively, if you are not interested in single samples but repetitively gained / lost regions, you could try the ADMIRE algorithm (http://www.ncbi.nlm.nih.gov/pubmed/23476020).
I am sure there are more packages that you could try, also in Bioconductor (see BiocViews term “CopyNumberVariation”), but my experience with any of these is limited.
You said that you have experience with CGHcall. Would you mind providing some example code as to how you would use the log2_read_counts.igv file as input with CGHcall? I'm reading the vignette now, but I'm a bit confused as to where the CopywriteR output intersects the CGHcall workflow. I'm sorry to ask such a rudimentary question, but I'm quite new to this kind of analysis.
Is there a way to extract calls (-2,-1,0,1,2) and segmented values directly from the segment.Rdata file, or do we have to go from the log2_read_counts.igv file? I would like to create a heatmap of gains and losses across all the chromosomes of a bunch of samples combined. Could you provide example code on how to extract the necessary info from the segment.Rdata file if it is possible?