Normalizing Depth Coverage between samples MEDIPS
1
0
Entering edit mode
pertille • 0
@pertille-9104
Last seen 2.7 years ago
Sweden

I have MeDIPseq results and,

I need to define states for each position in the genome (correspondent to a "CpG" sites) that vary between samples. The reason for this question is that the tools available, like "MeDIPS", define status to a "CpG window" and not for individual CpGs. When I tryed to set a window 2 (ws=2), the script was killed by synthesis error by bash.
This led me to make my own list of CpGs.

I have a table like this: The columns are: Chromosome number (chr), initial (start) and final (end) position of a interest base, the expected coverage or input (depth) , the observed coverage to different 6 animals (depth1-depth6).

Example:

data <- "chr start end depth depth1 depth2 depth3 depth4 depth5 depth6
chr1 3273 3273 7 200 35 1 200 850 0
chr1 3274 3274 3 50 25 5 300 1500 2
chr1 3275 3275 8 600 15 8 100 300 5
chr1 3276 3276 4 30 2 10 59 20 0
chr1 3277 3277 25 20 7 4 600 45 0"
data <- read.table(text=data, header=T)

I need to define a column with the states of each line, the states are: often metlylated, alternately methylated and rarely methylated.

To do that, first, I need to do a normalization of the depth between samples to obtain values that can be compared between the individuals. and, second, I have to define the range between the states (by now, any range is acceptable);

I have checked also this "edgeR package"function for R, which is used for RNAseq standardization data, and looks like, that it is used in MeDIPS too, that looks like this:

calcNormFactors(object, method=c("TMM","RLE","upperquartile","none"), refColumn = NULL,
logratioTrim = .3, sumTrim = 0.05, doWeighting=TRUE, Acutoff=-1e10, p=0.75)

but I could not apply to my data yet.

What I hope for my final result is something like this:

chr start State
chr1 3273 Often
chr1 3274 alternatively
chr1 3275 no
chr1 3276 often
chr1 3277 no

but... I would be really satisfied only with the normalized depth to each sample coverage.

normalization medips depth edgeR "R" • 887 views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

Not having worked with MeDIP-seq data, I can only make general comments about the suitability of TMM normalization in this context, mostly by considering analogies with ChIP-seq data. So, just keep that in mind.

For starters, it seems like you have counts for each site. If you plan on feeding those counts into calcNormFactors, the resulting normalization factors will be computed under the assumption that there are no differences in methylation across most sites in each sample. This assumption may or may not be appropriate, depending on the biology of your system. Also note that the function does not distinguish from genuine IP samples and inputs, so if you include the input counts in the matrix supplied to the function, this assumption will also apply to the input (whereby it will obviously be false, as you do expect a systematic difference in methylation between IP and input samples).

In such cases, it would seem that the normalizeChIPtoInput function in edgeR might be a better choice for normalizing, well, your IP to input. You could also count reads across genomic bins (assuming that most of the genome is not methylated), compute normalization factors from those bins, and then apply those factors to the analysis with your per-site counts. This assumes that background enrichment should be consistent between samples; any differences in the background coverage correspond to bias and are removed. Check out the csaw user's guide for a demonstration.

If all you want to do is to compute depth-adjusted coverage values, a straight-up application of the cpm function to the count matrix might be sufficient. This will convert your counts into count-per-million values that you can then compare informally between samples. Of course, you can also do this after computing normalization factors, provided you pick an appropriate normalization method. Note that if you want to do a more rigorous differential analysis, you're better off working off the raw counts with a package like edgeR.

0
Entering edit mode

Thank you very much. Normalized by "cpm" by "edgeR" package from R
Exemple:

{

data <- "chr start end depth depth1 depth2 depth3 depth4 depth5 depth6
chr1 3273 3273 7 200 35 1 200 850 0
chr1 3274 3274 3 50 25 5 300 1500 2
chr1 3275 3275 8 600 15 8 100 300 5
chr1 3276 3276 4 30 2 10 59 20 0
chr1 3277 3277 25 20 7 4 600 45 0"
data <- read.table(text=data, header=T)
datamatrix <- data [, c(4:10)]
library (edgeR)

#grouping factor
group <- c(1, 2, 2, 2, 2, 2, 2) #groups of each sample)

#create a DGEList
y <- DGEList(counts=datamatrix,group=group)

#########
$counts depth depth1 depth2 depth3 depth4 depth5 depth6 1 7 200 35 1 200 850 0 2 3 50 25 5 300 1500 2 3 8 600 15 8 100 300 5 4 4 30 2 10 59 20 0 5 25 20 7 4 600 45 0$samples
group lib.size norm.factors
depth      1       47            1
depth1     2      900            1
depth2     2       84            1
depth3     2       28            1
depth4     2     1259            1
depth5     2     2715            1
depth6     2        7            1
##################
#normalize
y <- calcNormFactors(y)

########
> y
An object of class "DGEList"
$counts depth depth1 depth2 depth3 depth4 depth5 depth6 1 7 200 35 1 200 850 0 2 3 50 25 5 300 1500 2 3 8 600 15 8 100 300 5 4 4 30 2 10 59 20 0 5 25 20 7 4 600 45 0$samples
group lib.size norm.factors
depth      1       47    0.7423735
depth1     2      900    0.5526927
depth2     2       84    0.9534847
depth3     2       28    0.8652676
depth4     2     1259    0.6588115
depth5     2     2715    1.0358307
depth6     2        7    4.3289213
##########################################

> cpm(y)
depth     depth1    depth2    depth3    depth4     depth5    depth6
1 200621.61  402071.90 436993.56  41275.42 241125.49 302245.841      0.00
2  85980.69  100517.97 312138.26 206377.10 361688.24 533375.014  66001.27
3 229281.84 1206215.69 187282.96 330203.36 120562.75 106675.003 165003.16
4 114640.92   60310.78  24971.06 412754.21  71132.02   7111.667      0.00
5 716505.76   40207.19  87398.71 165101.68 723376.48  16001.250      0.00

}

Normalized! Even with normalization, I have 3 samples that have a lot of values equal to zero, since they have low coverage. I think I'll have to delete them from analysis. I thought about making a PCA test to see how these samples are grouped. I would like some feedback on the method used for normalization and for the second part of my problem