Question

how to construct a DBA object from count matrix

0

Entering edit mode

pegah.taklifi ▴ 30

@pegahtaklifi-24293

Last seen 2.6 years ago

Iran

Hello I have a question regarding DiffBind analyzing of ATACseq data . I have ATACseq counts of control and case samples in a reference peak set as a count matrix where row are regions and columns are sample IDs,I do not have access to bam files nor peak calls of these samples . Is it possible to use DiffBind to identify regions that are differentially accessible between case and control ? if so how can I construct a DBA object from these count matrices ?

ChIPSeq ATACSeq differentially_occupied DiffBind • 2.6k views

ADD COMMENT • link updated 2.5 years ago by Johanna Hardin ▴ 20 • written 3.4 years ago by pegah.taklifi ▴ 30

0

Entering edit mode

Cross-posted: https://www.biostars.org/p/9480926/

ADD REPLY • link 3.4 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

ok,I deleted the post on biostars.

ADD REPLY • link 3.4 years ago pegah.taklifi ▴ 30

0

Entering edit mode

It was no problem - just making sure that users do not 'double up' on efforts. As far as I know, the DiffBind developer logs in occasionally here; so, we should expect an answer eventually.

ADD REPLY • link 3.4 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 5 weeks ago

Cambridge, UK

Regarding the performance, make sure you are on the very latest version (DiffBind_3.2.4) -- there is a performance improvement that hugely speeds up adding a lot of samples. However a 100,000 interval x 400 sample matrix will take a lot of memory, so if you have limited memory it will still take a long time no matter how you do it.

Regarding your code, I see two issues:

The column names need to be exact (including case): SampleID, Condition
In the count files you are writing out, you should not include the column headers, so set col.names=FALSE.

So building on the code snippet from my previous response:

peaks <- dba.peakset(tamoxifen, bRetrieve = TRUE, DataType=DBA_DATA_FRAME)
counts1 <- cbind(peaks[,1:3],countMatrix[,1])
counts2 <- cbind(peaks[,1:3],countMatrix[,2])

write.table(counts1, "samp1", row.names=FALSE, col.names=FALSE,sep="\t", 
            quote=FALSE)
write.table(counts2, "samp2", row.names=FALSE, col.names=FALSE,sep="\t", 
            quote=FALSE)

newDBA <- dba.peakset(NULL, peaks=peaks, sampID="samp1", 
                      condition="cond1", counts="samp1")
newDBA <- dba.peakset(newDBA, peaks=peaks, sampID="samp2", 
                      condition="cond2", counts="samp2")

newDBA

ADD COMMENT • link 3.4 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you very much

ADD REPLY • link 3.4 years ago pegah.taklifi ▴ 30

score 4 · Accepted Answer · 2021-07-19

4

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 5 weeks ago

Cambridge, UK

Yes you can do this (assuming you have the region coordinates, corresponding to the rows, in chromosome-start-end form.

There are two main ways to create the DBA object. You can do it with successive calls to dba.peakset(), adding the counts for each sample. For example:

data(tamoxifen_counts)
tamoxifen <- dba.count(tamoxifen, peaks=NULL, score=DBA_SCORE_READS)
counts <- dba.peakset(tamoxifen, bRetrieve = TRUE)

newDBA <- NULL
countMatrix <- mcols(counts)
for(sample in 1:ncol(countMatrix)) {
  newDBA <- dba.peakset(newDBA, peaks=counts, 
                        sampID=colnames(countMatrix)[sample],
                        counts=countMatrix[,sample])
}  

newDBA <- dba.contrast(newDBA,group1=c(1:2,8:9), group2=c(3:7,10:11))
dba.analyze(newDBA, bBlacklist = FALSE)

You can also add metadata by setting the tissue, factor, condition, treatment, and/or replicate parameters.

If you want to use a samplesheet to load the project using dba(), you can write each column of the count matrix to a separate file with four tab- or comma-delimited columns: chromosome, start, end, counts. The first three columns must be identical for all samples. The include these files in your samplesheet in a column headed Counts.

ADD COMMENT • link 3.4 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you for your response @roy-stark . The first approach you mentioned takes a very long time to run ( Im guessing it's because of the size of my data; 100,000 regions and 400 samples) then I tried using sample sheet to load dba object , I wrote each column of the count matrix in a separate file

for(  k in 1:ncol(countMatrix))
{m<- cbind(Pan_peaks[,1:3] , countMatrix[ ,k ])
  colnames(m)<- c("chromosome" ,"start" , "end" , "counts")
  write.table(m , paste("DiffBind_counts/" ,colnames(countMatrix)[k] ) 
              , row.names = FALSE , sep="\t" , quote=FALSE)
}

and here is a few lines of my sampleSheet :

 sampleID,Conditions,Counts
BRCA_1,Cancer,DiffBind_counts/BRCA_000CFD9F_ADDF_4304_9E60_6041549E189C_X017_S06_L011_B1_T1_P040
BRCA_2,Cancer,DiffBind_counts/BRCA_000CFD9F_ADDF_4304_9E60_6041549E189C_X017_S06_L012_B1_T2_P046
BRCA_3,Cancer,DiffBind_counts/BRCA_01112370_4F6F_4A20_9BE0_7975C3465268_X017_S04_L007_B1_T1_P042
BRCA_4,Cancer,DiffBind_counts/BRCA_01112370_4F6F_4A20_9BE0_7975C3465268_X017_S04_L008_B1_T2_P044
BRCA_5,Cancer,DiffBind_counts/BRCA_0142AAAC_FFE8_43B7_AB99_02F7A1740567_X022_S06_L057_B1_T1_P050

but when I try

newDBA<- dba(sampleSheet = "sampleSheet.csv")

I get the following message :

1   Cancer  NA counts
Error in sum(counts) : invalid 'type' (character) of argument
In addition: Warning messages:
1: In peakOrder(peaks[, 1], as.integer(peaks[, 2]), as.integer(peaks[,  :
  NAs introduced by coercion
2: In peakOrder(peaks[, 1], as.integer(peaks[, 2]), as.integer(peaks[,  :
  NAs introduced by coercion

I would appreciate it if you can point which part of my code is wrong . Thank you

ADD REPLY • link 3.4 years ago pegah.taklifi ▴ 30

0

Entering edit mode

@rory-stark-5741 thanks for this post to get us started on constructing a DBA object from counts. however, in your code, you use tamoxifen which is already a DBA object. does that mean that you start with a DBA object and replace the counts? we tried that, but it didn't work because we have a different number of regions than the tamoxifen object.

can you specify on which line of the code you have provided we should input the columns of our new file which is made up of chromosome / start / stop / a dozen columns of counts? or are we missing something altogether? do we need a separate peaks file? if so, where is it input to connect to the DBA object (in your code, it seems like the peaks file is generated elsewhere and contains a lot of metadata)? the big step we need is how to start with only the counts and not a DBA object.

thank you!!!

ADD REPLY • link 2.5 years ago Johanna Hardin ▴ 20