Extracting origin info from DB sites
4
0
Entering edit mode
@michalinbar-15043
Last seen 6.2 years ago

Hi DiffBind gurus,

I have 4 samples of 4 different tissues.
2 of them treatment, 2 control, here are the main cols of the csv: 

SampleID     Tissue   Condition   Replicate

B1_S_ME      MEF_S    control     1

B4_5F_ME     GETMS    treatment   1          

B7_4F_ME     GTS      treatment   2           

MEF_ME_SRR   MEF      control     2           

 

I contrasted control vs. treatment, and have the list of DB sites (I got all the way to DB report).
I'm puzzled with the last steps required for my analysis, and would appreciate your help.

For each site in DB sites, I want to identify the TISSUE it originated from.
I'm specifically interested to get the list of sites originating from TISSUE=GETMS.

below are the main steps and results:

 

peaks_min0 <- dba(sampleSheet="yossi_metadata_ME.csv", minOverlap=0)

peakset_cons_reps <- dba.peakset(peaks_min0, consensus = DBA_CONDITION, minOverlap=2)

peakset_cons_reps_restricted <-  dba(peakset_cons_reps, mask=peakset_cons_reps$masks$Consensus,
                                     minOverlap=0)

#Now retrieve the consensus of these:
cons_reps_data <- dba.peakset(peakset_cons_reps_restricted, bRetrieve=TRUE)

dba_cons_reps_only <- dba.count(peaks_min0, peaks=cons_reps_data, score=DBA_SCORE_TMM_READS_FULL)

cont_ME <- dba.contrast(dba_cons_reps_only, 
                        group1 = dba_cons_reps_only$masks$control, 
                        group2 = dba_cons_reps_only$masks$treatment,
                        name1="ME overlapped control", name2="ME overlapped treatment", 
                        minMembers=2)

 

cont_ME <- dba.analyze(cont_ME, bSubControl=FALSE)


> cont_ME
4 Samples, 83071 sites in matrix:
                ID Tissue Condition Replicate Caller Intervals FRiP
1          B1_S_ME  MEF_S   control         1 counts     83071 0.47
2         B4_5F_ME  GETMS treatment         1 counts     83071 0.56
3         B7_4F_ME   GTSM treatment         2 counts     83071 0.58
4 MEF_ME_SRR438553    MEF   control         2 counts     83071 0.52

1 Contrast:
                 Group1 Members1                  Group2 Members2 DB.DESeq2
1 ME overlapped control        2 ME overlapped treatment        2     14347


cont_ME_report <- dba.report(cont_ME)

> cont_ME_report_DB
GRanges object with 14347 ranges and 6 metadata columns:
        seqnames                 ranges strand |      Conc Conc_ME overlapped control
           <Rle>              <IRanges>  <Rle> | <numeric>                  <numeric>
  43757     chr2 [ 48790404,  48791870]      * |      7.42                       3.49
  68298     chr7 [ 13598533,  13600990]      * |      7.32                       3.81
  55509     chr4 [ 70932502,  70934825]      * |      7.22                       3.18
   9948    chr10 [114659494, 114662104]      * |      7.28                       4.25
  80420     chr9 [108173249, 108175844]      * |      7.37                       4.31
    ...      ...                    ...    ... .       ...                        ...
  35342    chr17 [ 56030657,  56033095]      * |      8.24                       7.83
  45447     chr2 [ 93998537,  93999161]      * |      7.78                       7.31
  78184     chr9 [ 54797722,  54800539]      * |       8.9                       8.59
  55082     chr4 [ 58371584,  58372346]      * |       4.3                       5.02
  10471    chr10 [126936260, 126938513]      * |      5.54                       6.14
        Conc_ME overlapped treatment      Fold   p-value       FDR
                           <numeric> <numeric> <numeric> <numeric>
  43757                         8.37     -4.88  1.83e-23  1.52e-18
  68298                         8.25     -4.44  9.29e-22  3.86e-17
  55509                         8.18        -5   2.6e-21  7.19e-17
   9948                         8.18     -3.93  5.46e-21  1.13e-16
  80420                         8.28     -3.97  1.78e-20  2.75e-16
    ...                          ...       ...       ...       ...
  35342                         8.57     -0.74   0.00862    0.0499
  45447                         8.14     -0.83   0.00862    0.0499
  78184                         9.16     -0.57   0.00862    0.0499
  55082                         2.79      2.24   0.00863      0.05
  10471                         4.49      1.65   0.00863      0.05

 

Many thanks!

Michal

 

 

 

 

DiffBind • 826 views
ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 7 days ago
Cambridge, UK

The usual way to figure out which samples "originated" peaks in the consensus set (more specifically, samples with peaks called that overlapped with consensus peaks) is to use the bCalled and bCalledDetail parameters in the final call to dba.report(). However when you pass in a consensus peakset explicitly to dba.count(), the information of how the consensus was formed is lost, so that option is unavailable.

You can use GRanges functions to compare the peaks returned by dba.report() (cont_ME_report) with the original set of peaks. You can get the original peakset of interest as follows:

getms_peaks <- dba.peakset(peaks_min0,
                           peaks=2,
                           bRetrieve=TRUE)

Then you can intersect/overlap cont_ME_report and getms_peaks; see eg. findOverlaps.

ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 7 days ago
Cambridge, UK

As you see, the two methods give different results as shown by the different numbers of consensus and differentially-bound sites. In the first condition, you include sites that are identified as peaks in at least two samples of the four total (2/4), so you can have sites that identified as peaks in one replicate of each group (but not necessarily in both replicates of a group). In the second method, you include peaks identified in both samples of each group (2/2). The former is a more lenient definition and results in more consensus peaks, but after analysis results in fewer differentially bound peaks than the more strict second method. 

It may seem odd that more consensus peaks results in fewer DB sites, especially since all the sites in the second set should be included in the first set, and hence could be identified as being DB. However the modelling is changes. The inclusion of more lenient peaks will increase the variance present in the dataset, and hence lessen the power to confidently identify lower effect sizes (ie the fold-change threshold increases). The multiple testing correction calculations are altered as well.

ADD COMMENT
0
Entering edit mode
@michalinbar-15043
Last seen 6.2 years ago

Hi again, and thanks for your answer.

I'd like to understand the difference between the two following ways for obtaining DB sites. they give me different results.

what I'm trying to do is:

1. within each group, use only sites common to both samples  (hence minOverlap=2)

2. contrast groups without additional constraint

3. if possible, keep bCalledDetail info in results

 

In each way I get a different number of sites and DESeq results.

I'd like to use the first way, which retains bCalledDetail info. but I wander if it achieves points 1 and 2 above.

Thank you!

1)

# 94908 sites
peaksMin0_counts <- dba.count(peaks_min0, minOverlap=2, score=DBA_SCORE_TMM_READS_FULL)


#94908 sites
cont_ME_2 <- dba.contrast(peaksMin0_counts, 
                                   categories=DBA_CONDITION, 
                                   minMembers = 2)

cont_ME_2 

4 Samples, 94908 sites in matrix:
                ID Tissue Condition Replicate Caller Intervals FRiP
1          B1_S_ME  MEF_S   control         1 counts     94908 0.50
2         B4_5F_ME  GETMS treatment         1 counts     94908 0.58
3         B7_4F_ME   GTSM treatment         2 counts     94908 0.60
4 MEF_ME_SRR438553    MEF   control         2 counts     94908 0.54

1 Contrast:
   Group1 Members1    Group2 Members2
1 control        2 treatment        2

 

# DESeq 11078
cont_ME_2 <- dba.analyze(cont_ME_2)

4 Samples, 94908 sites in matrix:
                ID Tissue Condition Replicate Caller Intervals FRiP
1          B1_S_ME  MEF_S   control         1 counts     94908 0.50
2         B4_5F_ME  GETMS treatment         1 counts     94908 0.58
3         B7_4F_ME   GTSM treatment         2 counts     94908 0.60
4 MEF_ME_SRR438553    MEF   control         2 counts     94908 0.54

1 Contrast:
   Group1 Members1    Group2 Members2 DB.DESeq2
1 control        2 treatment        2     11063

 

second way in next message:

ADD COMMENT
0
Entering edit mode
@michalinbar-15043
Last seen 6.2 years ago

2)

# 6 Samples, 94908 sites

peakset_cons_reps <- dba.peakset(peaks_min0, 
                                 consensus = DBA_CONDITION, 
                                 minOverlap=2)

# 2 Samples, 83071 sites
peakset_cons_reps_restricted <- dba(peakset_cons_reps, 
                                     mask=peakset_cons_reps$masks$Consensus,
                                     minOverlap=1)

#Now retrieve the consensus of these:
cons_reps_data <- dba.peakset(peakset_cons_reps_restricted, bRetrieve=TRUE)


#4 Samples, 83071 sites
dba_cons_reps_only <- dba.count(peaks_min0, peaks=cons_reps_data, score=DBA_SCORE_TMM_READS_FULL)

# 4 Samples, 83071 sites 
cont_ME <- dba.contrast(dba_cons_reps_only, 
                        group1 = dba_cons_reps_only$masks$control, 
                        group2 = dba_cons_reps_only$masks$treatment,
                        name1="ME overlapped control", name2="ME overlapped treatment", 
                        minMembers=2)
cont_ME
# DESeq 14347
cont_ME <- dba.analyze(cont_ME, bSubControl=FALSE)

ADD COMMENT

Login before adding your answer.

Traffic: 845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6