Search
Question: Call CNV in population with large depth variance
0
16 months ago by
United States
Xiao Zhang0 wrote:

Hi all,

I am now using cn.mops to call CNV in a plant population including 271 samples.Now the question is the sequencing depth variant from 0.00496X to 40.11X, the average depth is about 7X, most of them are 3X~10X. So do I need cluster the samples by depth that get several groups and then calculate by groups or punch files together for calling? Thank you.

modified 16 months ago by Günter Klambauer490 • written 16 months ago by Xiao Zhang0
0
16 months ago by
Austria
Günter Klambauer490 wrote:

Hello Xiao Zhang,

Yes, clustering the samples with respect to sequencing depth is certainly advisable. You can include the higher coverage samples when you analyze the low coverage ones. Let me explain this further: Let's say the low coverage samples are A, the medium coverage samples are B, and the high coverage samples C. Then you should make three cn.MOPS runs:

1.) cn.mops on A,B,C with large window length (low resolution) --> CNV calls for low coverage group A.

2.) cn.mops on B,C with a medium window length --> CNV calls for medium coverage group B.

3.) cn.mops on C with a small window length (high resolution) --> CNV calls for high coverage group C.

The reason is that adding more samples with higher coverage can improve the estimates of each DNA region. However, the CNV calls for the higher coverage samples are at a low resolution.

I hope this helps you with the analysis.

Regards,

Günter

Thank you Günter. Do I need set window length by my self or set by software automatically, which is better? Do you have any refereces to set window length?

Regards,

Xiao

Hello Xiao,

The program determines the window length automatically based upon the sample with the lowest number of reads (lowest coverage). However, I advise to do some calculations and set this parameter by hand such that on average about 50-100 reads map to each window (segment).

The average number of reads per window/segment is: averageReadCount=coverage*windowLength/readLength. Assuming you have want to have on average 50 reads in a segment/window, you have windowLength = readLength * 50 /coverage. For your low-coverage samples with coverage of 0.005, you should use a window length of 50*100/0.005=1e6bp (assuming a read length of 100). The smallest CNVs you will be able to detect is three times (determined by cn.mops's parameter "minWidth=3") this length, meaning 3e6bp. You will be able to detect only very large CNVs.

For the a medium coverage of 5X,  this formula suggests a window length of 1000bp and the smallest detected CNVs will be 3000bp (with "minWidth=3").

Regards,

Günter

Thank you Günter, this helps me a lot!

The other question is the result of data frame of "segmentation" function. The data frame contained several columns named "seqname", "start", "end", "width", "strand", "sample", "median", "mean" and "CN". Are the "median" and "mean" here both refer to the I/NI calls? How to filter this data frame to get more confident CNVs? I have read other's Q&A, you said "The farer the value is away from 0, the more likely there is a CNV", do you have a standard for this?

Regards,

Xiao