Question

dba.count with pre-defined regions

0

Entering edit mode

sylvia ▴ 10

@sylvia-5630

Last seen 8.0 years ago

Hi,
I'm trying to use dba.contrast and dba.analyze to find differential binding sites within specific regions (ie promoter region ranging from 1000bp upstream of TSS and 500 bp downstream of TSS for a set of genes). I realized that dba.count can use pre-defined regions with the peaks argument so I did something like the following:

# create dba object

TF.peaks <- dba(sampleSheet = "experiments.csv",peakCaller = "macs",peakFormat = "bed",scoreCol = 5 );

# create a GRanges object with specific region

genes <- read.delim('gene_list.txt', as.is = TRUE);
grange <- makeGRangesFromDataFrame(genes);

promoter <- promoters(grange, upstream = 1000, downstream = 500);

# count the reads in the pre-defined region before running dba.contrast and dba.analyze

TF.counts <- dba.count(DBA = TF.peaks, peaks = promoter, score = DBA_SCORE_RPKM);

I got no error message, however, my TF.counts chromosome regions are beyond the range of some chromosome sizes. For example:
range(TF.counts$peaks[[1]]$Start[TF.counts$peaks[[1]]$Chr == 'chr22'])
[1] 22082728 102264308
I double checked my input TF.peaks and promoter which all have the correct chr22 size
range(TF.peaks$peaks[[1]]$V2[TF.peaks$peaks[[1]]$V1 == 'chr22'])
[1] 17639813 51222046
promoter.sub <- promoter[promoter@seqnames == 'chr22',]
range(promoter.sub@ranges@start)
[1] 17081777 51221591
Any idea what's causing the issue? Or is there a better way to look at a differential binding site for specific regions? (I'm using R v3.3.1 and DiffBind v2.2.12)
Many thanks,

Sylvia

dba.count dba chipseq diffbind • 2.8k views

ADD COMMENT • link updated 8.2 years ago by Rory Stark ★ 5.3k • written 8.2 years ago by sylvia ▴ 10

score 1 · Answer 1 · 2017-12-12

1

Entering edit mode

Rory Stark ★ 5.3k

@rory-stark-5741

Last seen 13 months ago

Cambridge, UK

If you can send me the intervals you are passing in as promoter, I can take a look at what is happening. If possible, it would be good to have a copy of the main DBA object TF.peaks -- if this is too large, you can send me a download link (Dropbox or similar).

Cheers-

Rory

rory.stark @ cruk.cam.ac.uk

P.S. It is best to tag DiffBind questions with the DiffBind tag as well...

ADD COMMENT • link 8.2 years ago Rory Stark ★ 5.3k

0

Entering edit mode

Thanks Rory! Just sent.

Best,

Sylvia

ADD REPLY • link 8.2 years ago sylvia ▴ 10

score 1 · Answer 2 · 2017-12-12

1

Entering edit mode

Rory Stark ★ 5.3k

@rory-stark-5741

Last seen 13 months ago

Cambridge, UK

I've had a look and identified a bug, which should be fixed by the end of the week in DiffBind 2.6.2.

I see you are using an older version, I can provide a workaround. You just need to pass in the intervals as a data.frame, with the chromosome names (first column) as character strings:

> promoter <- promoters(grange, upstream = 1000, downstream = 500)
> promoterDF <- data.frame(promoter)
> promoterDF[,1] <- as.character(promoterDF[,1])
> TF.counts <- dba.count(DBA = TF.peaks, peaks = promoterDF, score = DBA_SCORE_RPKM)

Cheers-

Rory

ADD COMMENT • link 8.2 years ago Rory Stark ★ 5.3k

0

Entering edit mode

Hi Rory,

Thanks for the quick reply, It works now! One more naive question, if I used a pre-defined region is there still a way to re-center each peak around the point of greatest enrichment, like the summit function? so I can use a shorter sequence for motif finding?

Best,

Sylvia

ADD REPLY • link 8.2 years ago sylvia ▴ 10

1

Entering edit mode

Yes, the summits parameter should work to re-center intervals around the point of highest pileup.

However I just tested this, and found another bug. I'll check in a fix for DiffBind 2.6.3.

I do have a workaround for you:

> TF.counts <- dba.count(TF.peaks, peaks = promoter, score = DBA_SCORE_RPKM, summits=TRUE)
> TF.counts$called <- matrix(1,nrow(TF.counts$merged), length(TF.counts$peaks))
> TF.counts <- dba.count(TF.peaks, peaks = NULL, summits=100) # re-centered peaks 201bp wide

Also, a suggestion when working with these intervals representing annotated genomic features (eg. promoter regions), you should probably use the filter parameter to remove promoters that don't get mapped reads, otherwise there is the potential for a lot of intervals with zero or very low counts. I'd suggest setting filter=5 in the first call to dba.count().

Cheers-

Rory

ADD REPLY • link 8.2 years ago Rory Stark ★ 5.3k

0

Entering edit mode

Hi Rory,

For the second TF.counts do you mean

TF.counts <- dba.count("TF.counts", peaks = NULL, summits=100)? or I should still use TF.peaks?

This is really helpful! Thanks a lot.

Best,

Sylvia

ADD REPLY • link 8.2 years ago sylvia ▴ 10

0

Entering edit mode

Hi Rory,

Sorry, and one more thing, is it normal that not all the peaks having 201bp w after using summits function?

Best,

Sylvia

ADD REPLY • link 8.2 years ago sylvia ▴ 10