Question

DiffBind: reading in peaksets generated by Homer2

1

Entering edit mode

iliya.lefterov ▴ 10

@iliyalefterov-7211

Last seen 9.9 years ago

United States

Hi Rory,

Thanks for advising me to post the question to the main Bioconductor forum.

When trying to read in the peaksets with a command like this:

factorname =dba(sampleSheet="filename.csv")

I get this error message:

Error in Summary.factor(c(59L, 15L, 56L, 12L, 21L, 23L, 43L, 52L, 28L, :
‘max’ not meaningful for factors

I don’t have a PeakFormat column in the filename .csv file, and I am using BED files generated by Homer2. The BED files are plain tab delimited txt files with 4 columns; the 4th column has the peak scores in a format chr#-value.

I found that the deletion of the 4th column solves the problem and the peaksets are read in.

The question is if the peak scores are considered by DiffBind at all, and if the problem that arises later on (no way to get beyond factorname = dba.analyze(factorname) is a result of the reading in the peaksets?

Thanks in advance,

Iliya Lefterov

diffbind • 3.8k views

ADD COMMENT • link updated 9.6 years ago by sergio.espeso-gil • 0 • written 9.9 years ago by iliya.lefterov ▴ 10

score 1 · Answer 1 · 2015-01-05

1

Entering edit mode

Gord Brown ▴ 670

@gord-brown-5664

Last seen 3.9 years ago

United Kingdom

Hi,

Assuming you're using "pos2bed.pl" to generate the bed file, you should be aware that the 4th column "chrN-NN" is not the score. It's just the name of the peak (the NN'th peak on chromosome chrN). The 5th column is the score, but for some reason they're all set to 1. We can add support for HOMER format, but in the meantime you could use something like:

awk '{print $2,$3,$4,$1,$8,$5}' peaks.txt > peaks.bed

to generate a bed file with scores. (You'll need to remove the comment lines at the start of the file.)

Cheers,

- Gord

ADD COMMENT • link 9.9 years ago Gord Brown ▴ 670

0

Entering edit mode

The reason for the fifth column are all set to 1 is that I think there is a bug in the Perl script of pos2bed.pl? Not matter you use "-5" option or not (-5 : Set 5th column to the value 1 instead of value in 6th column of pos file), the fifth column will be 1 always.

ADD REPLY • link 6.8 years ago xiaofeiwang198266 • 0

0

Entering edit mode

Hi Gord, I think it should be $3-1 for the start position.

My understanding is that BED format is 0-based for the second column (chromStart), while it is 1-based for the peak txt file from HOMER. I double checked the script of "pos2bed.pl", which is "my $start = $line[2]-1;" to get start position. If I am not right, please correct me because I used $3-1 currently. Thanks a lot!

Also, could I ask a question about HOMER? Do you know how how HOMER calculates the score for a given peak? To be honest, I don't know the meaning of the "peak score". Here what I got from the HOMER manual as below. Actually, I used 6th column added to transfered BED file. What the difference between them to Diffbind? Thank you so... much!

Column 6: Normalized Tag Counts - number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user)

Column 8: Peak score (position adjusted reads from initial peak region - reads per position may be limited)

ADD REPLY • link 6.8 years ago xiaofeiwang198266 • 0

score 0 · Answer 2 · 2015-01-03

Hello Ilya-

As you discovered, the score column, if present, needs to be interpretable as a numeric value. In the default case, DiffBind looks in the fourth columns for a score.

If you remove the score column, then all the scores are set to 1. This results in binary score vectors for each sample, with a score of 1 for every peak that was called for that sample, and a score of -1 for peaks that were called in other samples but not that one. The scores are only used for clustering (correlation heatmaps and PCA plots) prior to the call to dba.count(). Scores based on actual read counts in every interval for every sample are used after that, particularly for dba.analyze(). So you lose very little by not having the more granular scores from the peak caller.

I'll add "support Homer2 format natively" to the list of possible features to add -- I can just strip off the "chr#-"part.

Cheers-

Rory

score 0 · Answer 3 · 2015-04-22

Hi all,

Another possibility if you don't want to play much with the "peaks.txt" file is to go to "pos2bed.pl" and change $line[0] by $line[7] in the command line (almost at the end of the script):

print $filePtr "$chr\t$start\t$end\t$line[0]\t$v\t$dir\n";

by

print $filePtr "$chr\t$start\t$end\t$line[7]\t$v\t$dir\n";

That will put the findPeaks score in the forth column. Hope this can help!