Using DiffBind blacklists and greylists with results mapped to BDGP6
2
0
Entering edit mode
Sam ▴ 10
@sam-21502
Last seen 4 days ago
Jerusalem

I have mapped my data to drosophila genome BDGP6.28. Hence, the bam files and peaks' chromosome names are 2L, 2R, 211000022278421, 211000022278482, et cetera. When I try to filter blacklists with dba.blacklist function

dba.blacklist(dbObj_initial, blacklist = DBA_BLACKLIST_DM6, greylist=FALSE)


I am informed that no blacklists are found, even though I have looked at the peaks, and saw that some do fall at blacklists' area from here.

More importantly, when I try to use greylists

dba.blacklist(dbObj_initial, blacklist = FALSE, greylist=DBA_BLACKLIST_DM6)

Error in value[[3L]](cond) : GreyListChIP error: Error in tileGenome(obj@karyotype, tilewidth = tileSize/2): 'seqlengths' contains NAs or negative values


I think both issues stem from the difference in chromosome names - in my peaks (and in my bam files) the chromosome names do not have "chr", and I guess that in the chromosome names DiffBind uses they do.

Is it possible to circumvent it and what would it take?

I can clean the blacklisted peaks by myself, using bedtools; but when it comes to greylists, I would be glad to enjoy the automatic use of greylists in DiffBind.

Is there any solution for it (perhaps using the diffloop package)?

DiffBind • 1.8k views
2
Entering edit mode
Rory Stark ★ 4.4k
@rory-stark-5741
Last seen 21 hours ago
CRUK, Cambridge, UK

OK, I see how to get the Seqinfo karyotype object for your genome:

BDGP6.28.ktype <- Seqinfo(genome="GCF_000001215.4")
dba.blacklist(dbObj_initial, blacklist = FALSE, greylist=BDGP6.28.ktype)


Let me know if this works!

0
Entering edit mode

It worked!

For some reason, getting the karyotype sizes via Seqinfo did not work for me with this specific genome, so I have downloaded a file which lists the sizes for each chromosome, and created a seqinfo object (using just the chromosome and chromosome length columns). I have passed the seqinfo object as the greylist parameter, and set the pvalue of the greylist to 0.99 (dbObj_greylist$config$greylist.pval), as in the GreyListChIP package (instead of the default 0.999 in DiffBind).

Then I have checked the greylist regions via igv, and indeed, they show a very large number of reads in the control samples. Since the number of reads is large in the greylisted regions (hundreds of reads), I think that a pvalue of 0.99 is not very permissive.

1
Entering edit mode

Glad to hear this worked. I'm going to document the ability to pass a Seqinfo object, and how to obtain one for a reference genome that are not explicitly supported in a BSgenome object.

In general I try to be conservative with the defaults, in the sense of altering the raw experimental data as little as possible, which is why the default pvalue is set at 0.999 instead of 0.99. In your case I think you are doing the right thing in restoring it to 0.99.

0
Entering edit mode

Is there a way to assess whether the GreyListChIP package did a good enough job? In my case, with the drosophila genome, and p-value of 0.99, the master greylist (of 3 control samples) had only 11 areas declared as grey . The coverage of the greylists was 0.03% for each of the samples only. Does that sound as "too little" to you?

0
Entering edit mode

Good question. I haven't run GreyListChIP on Drosophila, personally. However, it's a pretty compact genome, so it's plausible enough that there are relatively few "grey" regions. I'll give it a try on Monday and see what I get.

0
Entering edit mode
• Can the greylist and blacklist be applied after dba.count ?

• After using the same seqinfo object on another computer and after some time, now I get the following output :

Counting control reads for greylist...

Building greylist: 456.bam

coverage: 35483 bp (0.03%)

Master greylist: 11 ranges, 35483 bases

The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)

Removed: 25 of 26478 intervals.

Removed: 3 merged (of 9914) and 2 (of 5106) consensus.

11 Samples, 5104 sites in matrix (9911 total)


I don't understand what the "2 combined objects have no sequence levels in common" message means. (It seems that it did manage to remove some peaks base on the greylist.) How do I double-check that all is fine?

• When I look at the correlation values between samples

r_vals <-dba.plotHeatmap(dbObj)

the values are exactly the same whether I use the values before normalization (dba.normalize) or after it. I think that normalization should change the values. Does the heatmap show the values without normalization being applied? If so, how can normalized values be used to plot a correlation heatmap?

0
Entering edit mode

Yes, blacklists and greylists can be applied after dba.count().

Looking at previously discussed issue #101694, it would appear that there is some disagreement between the reference genome used for the bam file and the supplied Seqinfo, is that possible?

By default, the scores used in dba.plotHeatmap() are normalized. If you haven't called dba.normalize(), then the default normalization is applied (as if dba.normalize() has been called with all default parameters). You can see what the heatmap looks like using non-normalized values by changing the score. You can change the score without re-counting by calling dba.count() with peaks=NULL and setting the score parameter, eg. to score=DBA_SCORE_READS (raw read counts) or score=DBA_SCORE_READS_MINUS (raw reads minus control reads, minimum value minCount). You can change them back to normalized values using score=DBA_SCORE_NORMALIZED which will respect whatever you set up using dba.normalize().

0
Entering edit mode
• First, greylist appears to give different greylist lists depending on the order (whether before/after dba.count ). I would like to ask - can I assume that removing greylists before dba.count is the preferred (usual) order ?

Regarding the "2 combined objects have no sequence levels in common" message : I have changed the Seqinfo object to be read via seq_ktype <- Seqinfo(genome="GCF_000001215.4"), but the error message persists

Counting control reads for greylist...
Building greylist: ../../aligned_pe/sorted/noDup/456.sorted.noDup.bam
coverage: 26147 bp (0.02%)
Master greylist: 9 ranges, 26147 bases
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)Removed: 2 of 5107 intervals.
Removed 2 (of 5107) consensus peaks.
11 Samples, 5105 sites in matrix (5218 total):

[1] "X"  "2L" "2R" "3L" "3R" "4"
unique(dbObj_b_greylist$peaks[[1]]$seqnames)
[1] "211000022278100"            "211000022278421"            "211000022278482"            "211000022278626"            "211000022278632"            "211000022278808"            "211000022278878"
[8] "211000022279056"            "211000022279078"            "211000022279215"            "211000022279810"            "211000022279945"            "211000022280498"            "211000022280519"
[15] "211000022280599"            "2L"                         "2R"                         "3L"                         "3R"                         "4"                          "rDNA"
[22] "Unmapped_Scaffold_37_D1608" "X"                          "Y_mapped_Scaffold_18_D1698"

sum(!  dbObj_b_greylist$peaks[[1]]$seqnames %in% seqlevels(seq_ktype) )
[1] 0

1
Entering edit mode

It is possible for the Greylist to be calculated differently depending on whether you do it before or after counting, if there are chromosomes that have no peaks remaining after a consensus is taken. So, for example, if none of the peaks initially identified on chromosome 211000022278451 were included in the consensus, then the Greylist calculation would include reads on chromosome 211000022278451 in its model in the "before" case but not in the "after" case.

I've had a look and now I don't think that the warning is actually a problem, but I'm not 100% sure. To track it down further I'd need access to copies of your object before and after the greylist has been applied.

0
Entering edit mode

It is possible for the Greylist to be calculated differently depending on whether you do it before or after counting, if there are chromosomes that have no peaks remaining after a consensus is taken. So, for example, if none of the peaks initially identified on chromosome 211000022278451 were included in the consensus, then the Greylist calculation would include reads on chromosome 211000022278451 in its model in the "before" case but not in the "after" case.

But the bottom line should be the same, shouldn't it? E.g. chromosome 211000022278451 would not be included after counting after both greylists and counting (no matter the order). In my case, the correlation heatmap gives different results depending on the order.

1
Entering edit mode

In the pre-counting case, chromosome 211000022278451 is incorporated into the greylist model of what constitutes anomalous enrichment, while in the post-consensus case, it will not be included in the model. So the cut-offs for anomalies will be different.

0
Entering edit mode

The warning appears only in case that I apply a blacklist first, and a greylist after it. So it's the combination of grey+blacklist which generates this warning. If you would like to look into it, can I send the objects to you by email ?

1
Entering edit mode

Yes if you send me links to the objects (with all peaks, blacklisted/greylisted before counting, blacklisted/greylisted after counting), I can get to the bottom of this.

0
Entering edit mode

Hi, I have sent the links to rory DOT stark AT cruk.cam.ac.uk

1
Entering edit mode

I've checked in a fix that should handle this warning condition, it is active in the currently downloadable version if you do a BiocManager::install("DiffBind"). The fix should be there from version DiffBind_3.4.4 onwards.

0
Entering edit mode

So I understand that the warning was actually of no importance and could be ignored ?

0
Entering edit mode

That is correct.

1
Entering edit mode
Rory Stark ★ 4.4k
@rory-stark-5741
Last seen 21 hours ago
CRUK, Cambridge, UK

There is only direct support in DiffBind for reference genomes that have a BSgenome entry; for drosophila, these are the UCSC versions with "chr" at the beginning of the chromosome names. I see you used the Ensembl version BDGP6.28.

Just to make sure this is the problem, you can try setting blacklist=TRUE and greylist=TRUE (instead of setting them to DBA_BLACKLIST_DM6), as this will result in DiffBind searching all BSgenome objects for a match.

An alternative is to use the GreyListChiP package directly; you can build the greylists using only two or three commands (as DiffBind does internally). GreyListChIP can accect a karyotype, wither as an Seqinfo object or in a text file, with the names and lengths of all the chromosomes in your reference genome. You can use Rsamtools::scanBamHeader() to retrieve this information directly from your aligned bam files (in the \$targets).

Finally, there is an undocumented feature: if you do have the karyotype information in a Seqinfo object, you can pass that object as the value for greylist= when calling dba.blacklist() and it should work!

0
Entering edit mode

Hi Roy,

I followed your suggestion and set blacklist=TRUE and greylist=TRUE. My code is: tamoxifen <- dba.blacklist(tamoxifen, blacklist=TRUE, greylist=TRUE) But I got the error: Genome detected: Hsapiens.NCBI.GRCh38 Applying blacklist... Removed: 475 of 979966 intervals. Counting control reads for greylist... Error in value[3L] : GreyListChIP error: Error in tileGenome(obj@karyotype, tilewidth = tileSize/2): 'seqlengths' contains NAs or negative values In addition: Warning message: In BiocParallel::MulticoreParam(workers <- usecores) : MulticoreParam() not supported on Windows, use SnowParam()

Can you help me with this?

Thanks, Wei

0
Entering edit mode

I checked in a fix for this yesterday (3.0.10). The workaround is to set cores=1 when calling dba.blacklist()`.