Hi Lubov,
thank you for using qsea, for your interest, and for your excellent questions.
1) The idea is to compute CNVs from (MeDIP or MBD) enriched reads for large windows, typically 1-2 MB. For this analysis only read fragments without any CpGs are considered. If there are not sufficient reads for a region of that size, no CNVs are calculated for this window. These regions appear gray in the CNV plot.
2) In order to estimate the average number of CpGs per fragment within a genomic window of length l_wd, I assume that a fragment may be centered at each genomic position with equal probability. In the (simpler) case of a fixed fragment length l_f (e.g. sd of fragment length is set to 0), each CpG is contained in exactly l_f potential fragments (Figure A). As fragments are assigned uniquely to the window containing its center position, each window is covered by l_wd potential fragments. The CpG density is the average number of CpGs in each of these potential fragments. By this definition CpGs in the l_f / 2 neighborhood of window also have an impact on the CpG density of that window.
If fragment length sd is set >0, the fragment length is modeled with a gaussian distribution (with tails cut at mean +/-3 sd). This way CpGs further away may also influence the local CpG density (Figure B and C).
Figure: CpG density estimation.
(A) Fraction of fragments assigned to genomic window containing CpG, assuming fixed fragment size of 200 bases. Horizontal bars represent potential sequencing fragments, orange bars are assigned to the window of interest; every 10 th fragment is depicted. A CpG at the center position of the window is contained in 200 of the 250 potential fragments of the window (80%). (B) Probability that a CpG is contained in a fragment with expected length 200, dependent on the distance to the fragment center, for different standard derivations. (C) Fraction of fragments assigned to genomic window containing CpG, assuming normally distributed fragment length with mean of 200 bases and different standard derivations.
3) getOffset returns the currently specified "background reads" of the samples. The efficiency of the MeDIP enrichment step is highly variable and can range from 25 to >100 fold. As a result, the fraction of "original" input fragments within the MeDIP enriched sample typically varies between 2% and 40%. This fraction of input fragments is the cause of the observation of methylation independent "background reads". In order to correctly compensate for the background reads, the sample specific offset parameter must be estimated. For this purpose, I make use of regions that lack CpG dinucleotides, and which should therefore show no MeDIP enrichment. The average number of fragments covering these CpG absent windows relative to the overall average number of fragments provides an estimate for the fraction of background reads. This method is implemented in the 'addOffset' function. If the result varies more than expected between samples, maybe the method is not appropriate in your specific case. You can set your own offset estimate by specifying the offset parameter in 'addOffset', see ?addOffset. However, maybe experimental conditions and/or sample quality does vary between samples. You may also want to consider discarding the samples with high offset fraction, as it indicates poor enrichment.
4) Fold changes are maximum likelihood estimates, computed by the GLM algorithm. Further, 'psydocounts' are added to avoid division by zero. This explains the difference in the rpm normalization. For the beta normalization, the GLM does actually not apply the transformation, but uses the scaling factors only, so it the result cannot be interpreted as a logFC of beta values. In my experience, the difference of beta values (e.g. % methylation) is more meaningful.
5) These tasks are performed with the makeTable function, see ?makeTable. If you have any specific questions, let me know.
Best, Matthias
Matthias, thank you very much for your answers! I will add more questions later.