Question

Extracting origin info from DB sites

0

Entering edit mode

michal.inbar • 0

@michalinbar-15043

Last seen 6.2 years ago

Hi DiffBind gurus,

I have 4 samples of 4 different tissues.
2 of them treatment, 2 control, here are the main cols of the csv:

SampleID     Tissue   Condition   Replicate

B1_S_ME      MEF_S    control     1

B4_5F_ME     GETMS    treatment   1          

B7_4F_ME     GTS      treatment   2           

MEF_ME_SRR   MEF      control     2

I contrasted control vs. treatment, and have the list of DB sites (I got all the way to DB report).
I'm puzzled with the last steps required for my analysis, and would appreciate your help.

For each site in DB sites, I want to identify the TISSUE it originated from.
I'm specifically interested to get the list of sites originating from TISSUE=GETMS.

below are the main steps and results:

peaks_min0 <- dba(sampleSheet="yossi_metadata_ME.csv", minOverlap=0)

peakset_cons_reps <- dba.peakset(peaks_min0, consensus = DBA_CONDITION, minOverlap=2)

peakset_cons_reps_restricted <- dba(peakset_cons_reps, mask=peakset_cons_reps$masks$Consensus, minOverlap=0)

#Now retrieve the consensus of these: cons_reps_data <- dba.peakset(peakset_cons_reps_restricted, bRetrieve=TRUE)

dba_cons_reps_only <- dba.count(peaks_min0, peaks=cons_reps_data, score=DBA_SCORE_TMM_READS_FULL)

cont_ME <- dba.contrast(dba_cons_reps_only, group1 = dba_cons_reps_only$masks$control, group2 = dba_cons_reps_only$masks$treatment, name1="ME overlapped control", name2="ME overlapped treatment", minMembers=2)

cont_ME <- dba.analyze(cont_ME, bSubControl=FALSE)

> cont_ME 4 Samples, 83071 sites in matrix: ID Tissue Condition Replicate Caller Intervals FRiP 1 B1_S_ME MEF_S control 1 counts 83071 0.47 2 B4_5F_ME GETMS treatment 1 counts 83071 0.56 3 B7_4F_ME GTSM treatment 2 counts 83071 0.58 4 MEF_ME_SRR438553 MEF control 2 counts 83071 0.52

1 Contrast: Group1 Members1 Group2 Members2 DB.DESeq2 1 ME overlapped control 2 ME overlapped treatment 2 14347

cont_ME_report <- dba.report(cont_ME)

> cont_ME_report_DB GRanges object with 14347 ranges and 6 metadata columns: seqnames ranges strand | Conc Conc_ME overlapped control <Rle> <IRanges> <Rle> | <numeric> <numeric> 43757 chr2 [ 48790404, 48791870] * | 7.42 3.49 68298 chr7 [ 13598533, 13600990] * | 7.32 3.81 55509 chr4 [ 70932502, 70934825] * | 7.22 3.18 9948 chr10 [114659494, 114662104] * | 7.28 4.25 80420 chr9 [108173249, 108175844] * | 7.37 4.31 ... ... ... ... . ... ... 35342 chr17 [ 56030657, 56033095] * | 8.24 7.83 45447 chr2 [ 93998537, 93999161] * | 7.78 7.31 78184 chr9 [ 54797722, 54800539] * | 8.9 8.59 55082 chr4 [ 58371584, 58372346] * | 4.3 5.02 10471 chr10 [126936260, 126938513] * | 5.54 6.14 Conc_ME overlapped treatment Fold p-value FDR <numeric> <numeric> <numeric> <numeric> 43757 8.37 -4.88 1.83e-23 1.52e-18 68298 8.25 -4.44 9.29e-22 3.86e-17 55509 8.18 -5 2.6e-21 7.19e-17 9948 8.18 -3.93 5.46e-21 1.13e-16 80420 8.28 -3.97 1.78e-20 2.75e-16 ... ... ... ... ... 35342 8.57 -0.74 0.00862 0.0499 45447 8.14 -0.83 0.00862 0.0499 78184 9.16 -0.57 0.00862 0.0499 55082 2.79 2.24 0.00863 0.05 10471 4.49 1.65 0.00863 0.05

Many thanks!

Michal

DiffBind • 826 views

ADD COMMENT • link updated 6.2 years ago by Rory Stark ★ 5.1k • written 6.2 years ago by michal.inbar • 0

score 1 · Answer 1 · 2018-02-18

The usual way to figure out which samples "originated" peaks in the consensus set (more specifically, samples with peaks called that overlapped with consensus peaks) is to use the bCalled and bCalledDetail parameters in the final call to dba.report(). However when you pass in a consensus peakset explicitly to dba.count(), the information of how the consensus was formed is lost, so that option is unavailable.

You can use GRanges functions to compare the peaks returned by dba.report() (cont_ME_report) with the original set of peaks. You can get the original peakset of interest as follows:

getms_peaks <- dba.peakset(peaks_min0,
                           peaks=2,
                           bRetrieve=TRUE)

Then you can intersect/overlap cont_ME_report and getms_peaks; see eg. findOverlaps.

score 1 · Answer 2 · 2018-02-20

As you see, the two methods give different results as shown by the different numbers of consensus and differentially-bound sites. In the first condition, you include sites that are identified as peaks in at least two samples of the four total (2/4), so you can have sites that identified as peaks in one replicate of each group (but not necessarily in both replicates of a group). In the second method, you include peaks identified in both samples of each group (2/2). The former is a more lenient definition and results in more consensus peaks, but after analysis results in fewer differentially bound peaks than the more strict second method.

It may seem odd that more consensus peaks results in fewer DB sites, especially since all the sites in the second set should be included in the first set, and hence could be identified as being DB. However the modelling is changes. The inclusion of more lenient peaks will increase the variance present in the dataset, and hence lessen the power to confidently identify lower effect sizes (ie the fold-change threshold increases). The multiple testing correction calculations are altered as well.

score 0 · Answer 3 · 2018-02-20

Hi again, and thanks for your answer.

I'd like to understand the difference between the two following ways for obtaining DB sites. they give me different results.

what I'm trying to do is:

1. within each group, use only sites common to both samples (hence minOverlap=2)

2. contrast groups without additional constraint

3. if possible, keep bCalledDetail info in results

In each way I get a different number of sites and DESeq results.

I'd like to use the first way, which retains bCalledDetail info. but I wander if it achieves points 1 and 2 above.

Thank you!

1)

# 94908 sites peaksMin0_counts <- dba.count(peaks_min0, minOverlap=2, score=DBA_SCORE_TMM_READS_FULL)

#94908 sites cont_ME_2 <- dba.contrast(peaksMin0_counts, categories=DBA_CONDITION, minMembers = 2)

cont_ME_2

4 Samples, 94908 sites in matrix: ID Tissue Condition Replicate Caller Intervals FRiP 1 B1_S_ME MEF_S control 1 counts 94908 0.50 2 B4_5F_ME GETMS treatment 1 counts 94908 0.58 3 B7_4F_ME GTSM treatment 2 counts 94908 0.60 4 MEF_ME_SRR438553 MEF control 2 counts 94908 0.54

1 Contrast: Group1 Members1 Group2 Members2 1 control 2 treatment 2

# DESeq 11078 cont_ME_2 <- dba.analyze(cont_ME_2)

4 Samples, 94908 sites in matrix: ID Tissue Condition Replicate Caller Intervals FRiP 1 B1_S_ME MEF_S control 1 counts 94908 0.50 2 B4_5F_ME GETMS treatment 1 counts 94908 0.58 3 B7_4F_ME GTSM treatment 2 counts 94908 0.60 4 MEF_ME_SRR438553 MEF control 2 counts 94908 0.54

1 Contrast: Group1 Members1 Group2 Members2 DB.DESeq2 1 control 2 treatment 2 11063

second way in next message:

score 0 · Answer 4 · 2018-02-20

2)

# 6 Samples, 94908 sites

peakset_cons_reps <- dba.peakset(peaks_min0, consensus = DBA_CONDITION, minOverlap=2)

# 2 Samples, 83071 sites peakset_cons_reps_restricted <- dba(peakset_cons_reps, mask=peakset_cons_reps$masks$Consensus, minOverlap=1)

#Now retrieve the consensus of these: cons_reps_data <- dba.peakset(peakset_cons_reps_restricted, bRetrieve=TRUE)

#4 Samples, 83071 sites dba_cons_reps_only <- dba.count(peaks_min0, peaks=cons_reps_data, score=DBA_SCORE_TMM_READS_FULL)

# 4 Samples, 83071 sites cont_ME <- dba.contrast(dba_cons_reps_only, group1 = dba_cons_reps_only$masks$control, group2 = dba_cons_reps_only$masks$treatment, name1="ME overlapped control", name2="ME overlapped treatment", minMembers=2) cont_ME # DESeq 14347 cont_ME <- dba.analyze(cont_ME, bSubControl=FALSE)