Question

MBDseq analysis using qsea package

0

Entering edit mode

LNathanson • 0

@lnathanson-21694

Last seen 6.3 years ago

I am trying to use R/Bioconductor qsea package for analysis of my MBDseq data. qsea is intuitive in many ways, but the main advantage is that it does not give unsolvable errors. However, there are some parts that I would like to understand better. Honestly, I am a beginner in bioinformatics, so my questions may seem dumb for you, and I apologize in advance. 1) CNV are estimated based on the reads without CpG. How does it apply to the windows with CpG? 2) What exactly is CpG density? For example, I have a window of 250bp, and there are 3 CpG sites. What is CpG density of this window? Is it 6/250=0.024? Or 6/250*100=2.4%? 3) I have 19 samples. getOffset function results in one sample with offset of 0.91, five samples have offset of 0.6-0.7, two samples have offset of 0.4-0.5, and eleven samples have offset of 0.02-0.25. I used the same enrichment model for all samples, they have been processed together. I used the blind enrichment model. How can I reduce the offset? 4) How do you calculate fold change? I tried beta normalization and rpm normalization. When I look at the average beta or rpm values and try to calculate fold change (and then log2 of fold change), I get a value closer to one when I use rpm, but still not the same as QSEA gives me. 5) How can I extract number of raw reads and number of normalized reads per window for each sample? I mean how can I get matrix, where rows will be windows, and columns will be samples (or vice versa), and values will be raw reads or normalized reads?

Thank you very much for your help!

qsea MBDseq • 2.2k views

ADD COMMENT • link updated 11 months ago by Lukas ▴ 20 • written 6.3 years ago by LNathanson • 0

score 1 · Answer 1 · 2019-08-26

Hi Lubov, thank you for using qsea, for your interest, and for your excellent questions.

1) The idea is to compute CNVs from (MeDIP or MBD) enriched reads for large windows, typically 1-2 MB. For this analysis only read fragments without any CpGs are considered. If there are not sufficient reads for a region of that size, no CNVs are calculated for this window. These regions appear gray in the CNV plot.

2) In order to estimate the average number of CpGs per fragment within a genomic window of length l_wd, I assume that a fragment may be centered at each genomic position with equal probability. In the (simpler) case of a fixed fragment length l_f (e.g. sd of fragment length is set to 0), each CpG is contained in exactly l_f potential fragments (Figure A). As fragments are assigned uniquely to the window containing its center position, each window is covered by l_wd potential fragments. The CpG density is the average number of CpGs in each of these potential fragments. By this definition CpGs in the l_f / 2 neighborhood of window also have an impact on the CpG density of that window.

If fragment length sd is set >0, the fragment length is modeled with a gaussian distribution (with tails cut at mean +/-3 sd). This way CpGs further away may also influence the local CpG density (Figure B and C).

Figure: CpG density estimation.

(A) Fraction of fragments assigned to genomic window containing CpG, assuming fixed fragment size of 200 bases. Horizontal bars represent potential sequencing fragments, orange bars are assigned to the window of interest; every 10 th fragment is depicted. A CpG at the center position of the window is contained in 200 of the 250 potential fragments of the window (80%). (B) Probability that a CpG is contained in a fragment with expected length 200, dependent on the distance to the fragment center, for different standard derivations. (C) Fraction of fragments assigned to genomic window containing CpG, assuming normally distributed fragment length with mean of 200 bases and different standard derivations.

3) getOffset returns the currently specified "background reads" of the samples. The efficiency of the MeDIP enrichment step is highly variable and can range from 25 to >100 fold. As a result, the fraction of "original" input fragments within the MeDIP enriched sample typically varies between 2% and 40%. This fraction of input fragments is the cause of the observation of methylation independent "background reads". In order to correctly compensate for the background reads, the sample specific offset parameter must be estimated. For this purpose, I make use of regions that lack CpG dinucleotides, and which should therefore show no MeDIP enrichment. The average number of fragments covering these CpG absent windows relative to the overall average number of fragments provides an estimate for the fraction of background reads. This method is implemented in the 'addOffset' function. If the result varies more than expected between samples, maybe the method is not appropriate in your specific case. You can set your own offset estimate by specifying the offset parameter in 'addOffset', see ?addOffset. However, maybe experimental conditions and/or sample quality does vary between samples. You may also want to consider discarding the samples with high offset fraction, as it indicates poor enrichment.

4) Fold changes are maximum likelihood estimates, computed by the GLM algorithm. Further, 'psydocounts' are added to avoid division by zero. This explains the difference in the rpm normalization. For the beta normalization, the GLM does actually not apply the transformation, but uses the scaling factors only, so it the result cannot be interpreted as a logFC of beta values. In my experience, the difference of beta values (e.g. % methylation) is more meaningful.

5) These tasks are performed with the makeTable function, see ?makeTable. If you have any specific questions, let me know.

Best, Matthias