Question

Detection of methylated regions from MeDIP data using csaw

0

Entering edit mode

nobishvarghese • 0

@nobishvarghese-10577

Last seen 7.5 years ago

I tried to use csaw to compare samples from MeDIP data. My dataset has 4 conditions namely group1, group2, group3 and group4 with one ChIP and Input sample for each condition (unfortunately, we have only a single replicate per condition for now; because of this, I used glmFit function with a dispersion value of 0.05 as suggested in the user manual). But then, after analyzing the data as outlined below, I have a few questions. It would be really helpful if someone could share their insight on this.

I‘ve seen that the authors suggested a window width of 10 for transcription factor experiments and 150 for histone modification data. Is there any ‘specified’ width threshold for MeDIP data?

Are there any specific guidelines for comparing ChIP samples to Input samples (detection of methylated regions) in csaw? As I would be looking for only regions enriched in ChIP sample compared to Input sample, I guess, the regions with more number of windows satisfying LogFC < -0.5 would represent noisy regions. I filtered out all windows that have log fold change less than zero (showing enrichment in input samples) before merging windows to regions using ‘mergeWindows’ function. Is this a valid approach?

max                   <- 650
width                 <- 150
filter                <- 20
pet.param             <- readParam(max.frag=max, pe="both",dedup=TRUE)
petBAMFile            <- c("group1_sorted.bam","group1I_sorted.bam","group2_sorted.bam","group2I_sorted.bam","group3_sorted.bam","group3I_sorted.bam","group4_sorted.bam","group4I_sorted.bam")
data                  <- windowCounts(petBAMFile, param=pet.param,width=width,filter=filter,ext=250)
binned                <- windowCounts(petBAMFile, bin=TRUE, width=10000, param=pet.param,ext=250)
normfacs              <- normalize(binned)
sampleinfo            <- data.frame(c("group1","group1I","group2","group2I","group3","group3I","group4","group4I"),c("group1","group1I","group2","group2I","group3","group3I","group4","group4I"))
names(sampleinfo)     <- c("sampleID","group")
myfactor              <- factor(sampleinfo$group)
y                     <- asDGEList(data,norm.factors=normfacs,group=myfactor)
design                <- model.matrix(~0+group, data=y$samples)
fit.norep             <- glmFit(y, design, dispersion=0.05)

For comparing ChIP to input sample, I used only windows with a positive fold change (ChIP versus Input) for merging into regions as below.

results_group1 <- glmLRT(fit.norep, contrast=c(1,-1,0,0,0,0,0,0)) up_group1 <- results_group1$table$logFC > 0 results_up_group1 <- results_group1$table[up_group1,] data_group1 <- data[up_group1] merged_group1 <- mergeWindows(rowRanges(data_group1), tol=1000L) tabcom_group1 <- combineTests(merged_group1$id, results_up_group1) anno_group1 <- detailRanges(merged_group1$region, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene,orgdb=org.Hs.eg.db, promoter=c(3000, 1000), dist=10000) write.table(data.frame(as.data.frame(merged_group1$region)[,1:3], tabcom_group1, anno_group1),file="group1.regions", row.names=FALSE, quote=FALSE, sep="\t")

For comparison across conditions, windows were not filtered based on fold change.

results_group1_group3 <- glmLRT(fit.norep, contrast=c(1,0,0,0,-1,0,0,0)) merged <- mergeWindows(rowRanges(data), tol=1000L) tabcom_group1_group3 <- combineTests(merged$id, results_group1_group3$table) anno <- detailRanges(merged$region, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene,orgdb=org.Hs.eg.db, promoter=c(3000, 1000), dist=10000) write.table(data.frame(as.data.frame(merged$region)[,1:3], tabcom_group1_group3, anno),file="group1_group3.regions", row.names=FALSE, quote=FALSE, sep="\t")

> sessionInfo() R version 3.2.4 (2016-03-10) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.9.5 (Mavericks)

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [2] GenomicFeatures_1.22.13 [3] org.Hs.eg.db_3.2.3 [4] RSQLite_1.0.0 [5] DBI_0.3.1 [6] AnnotationDbi_1.32.3 [7] edgeR_3.12.0 [8] limma_3.26.9 [9] csaw_1.4.1 [10] SummarizedExperiment_1.0.2 [11] Biobase_2.30.0 [12] GenomicRanges_1.22.4 [13] GenomeInfoDb_1.6.3 [14] IRanges_2.4.8 [15] S4Vectors_0.8.11 [16] BiocGenerics_0.16.1

loaded via a namespace (and not attached): [1] XVector_0.10.0 zlibbioc_1.16.0 GenomicAlignments_1.6.3 [4] BiocParallel_1.4.3 tools_3.2.4 lambda.r_1.1.7 [7] futile.logger_1.4.1 rtracklayer_1.30.4 futile.options_1.0.0 [10] bitops_1.0-6 biomaRt_2.26.1 RCurl_1.95-4.8 [13] Rsamtools_1.22.0 Biostrings_2.38.4 XML_3.98-1.4

csaw medip-chip • 886 views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by nobishvarghese • 0

score 3 · Accepted Answer · 2016-05-02

I haven't worked much with MeDIP data, but here's my thoughts:

The window size can be interpreted as the width of contact between the protein and DNA (for ChIP-seq). In your case, the width of contact between a single methylated cytosine and the DNA would be just 1 bp (plus a bit from the antibody during IP). So, it would seem reasonable to use a 10 bp window to pick up methylation at a single cytosine. That said, if you want more power to detect consistent changes in methylation across a larger region (e.g., at CpG islands/shores), then you should use larger windows. This will increase counts, at the cost of reduced resolution to detect single-base changes. For example, the default window size in MeDIPS is 100 bp, which seems sensible enough. I'd be inclined to try out a range of sizes from 10 to 1000 bp, and see what kind of results you get. You can consolidate results from all window sizes with consolidateSizes at the end of all your analyses.
Don't filter on the magnitude of the DB log-fold change. This is not an independent filter and will compromise the validity of the multiple testing correction. The same applies, oddly enough, to the sign of the log-fold change. This is because of correlations between windows, such that adjacent windows with the same sign are more likely to have low p-values and strong apparent DB (otherwise the direction of change wouldn't be likely to persist across windows). Filtering to select windows with the same sign will cause the clustering to be dependent on the DB p-value, which is not desirable. For example, if you were to select windows with only positive ChIP/input log-fold changes, you'd end up with some conservativeness; windows with consistent signs and low p-values would generally form fewer, larger clusters, while windows with large p-values would form many clusters.

I would be inclined to cluster all windows first without filtering on the log-fold change, and then worry about the sign of the windows in each cluster afterwards. If your experiment worked, then the number of windows with the "wrong" sign of the log-fold change should be minimal (< 5%, depending on the FDR). If your experiment didn't work... well, filtering out the offending windows is like sweeping the problem under the carpet. If you have many negative log-fold changes, how reliable would your positive log-fold changes be?

Finally, beware of read stacks. These are comprised of PCR duplicates and can contribute to spurious signal in both ChIP and input tracks. Normally, replicates protect against this because it's rather unlikely to get PCR duplication at the same location in two replicates; however, without replicates, you don't get this protection so you can get a lot of stacks when your data isn't of high quality.

P.S. Check out this related post at A: csaw - workflow to incorporate input/control samples?