how to construct a DBA object from count matrix
2
0
Entering edit mode
@pegahtaklifi-24293
Last seen 10 days ago
Iran

Hello I have a question regarding DiffBind analyzing of ATACseq data . I have ATACseq counts of control and case samples in a reference peak set as a count matrix where row are regions and columns are sample IDs,I do not have access to bam files nor peak calls of these samples . Is it possible to use DiffBind to identify regions that are differentially accessible between case and control ? if so how can I construct a DBA object from these count matrices ?

ChIPSeq ATACSeq differentially_occupied DiffBind • 247 views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

ok,I deleted the post on biostars.

ADD REPLY
0
Entering edit mode

It was no problem - just making sure that users do not 'double up' on efforts. As far as I know, the DiffBind developer logs in occasionally here; so, we should expect an answer eventually.

ADD REPLY
4
Entering edit mode
Rory Stark ★ 4.0k
@rory-stark-5741
Last seen 1 day ago
CRUK, Cambridge, UK

Yes you can do this (assuming you have the region coordinates, corresponding to the rows, in chromosome-start-end form.

There are two main ways to create the DBA object. You can do it with successive calls to dba.peakset(), adding the counts for each sample. For example:

data(tamoxifen_counts)
tamoxifen <- dba.count(tamoxifen, peaks=NULL, score=DBA_SCORE_READS)
counts <- dba.peakset(tamoxifen, bRetrieve = TRUE)

newDBA <- NULL
countMatrix <- mcols(counts)
for(sample in 1:ncol(countMatrix)) {
  newDBA <- dba.peakset(newDBA, peaks=counts, 
                        sampID=colnames(countMatrix)[sample],
                        counts=countMatrix[,sample])
}  

newDBA <- dba.contrast(newDBA,group1=c(1:2,8:9), group2=c(3:7,10:11))
dba.analyze(newDBA, bBlacklist = FALSE)

You can also add metadata by setting the tissue, factor, condition, treatment, and/or replicate parameters.

If you want to use a samplesheet to load the project using dba(), you can write each column of the count matrix to a separate file with four tab- or comma-delimited columns: chromosome, start, end, counts. The first three columns must be identical for all samples. The include these files in your samplesheet in a column headed Counts.

ADD COMMENT
0
Entering edit mode

Thank you for your response @roy-stark . The first approach you mentioned takes a very long time to run ( Im guessing it's because of the size of my data; 100,000 regions and 400 samples) then I tried using sample sheet to load dba object , I wrote each column of the count matrix in a separate file

for(  k in 1:ncol(countMatrix))
{m<- cbind(Pan_peaks[,1:3] , countMatrix[ ,k ])
  colnames(m)<- c("chromosome" ,"start" , "end" , "counts")
  write.table(m , paste("DiffBind_counts/" ,colnames(countMatrix)[k] ) 
              , row.names = FALSE , sep="\t" , quote=FALSE)
}

and here is a few lines of my sampleSheet :

 sampleID,Conditions,Counts
BRCA_1,Cancer,DiffBind_counts/BRCA_000CFD9F_ADDF_4304_9E60_6041549E189C_X017_S06_L011_B1_T1_P040
BRCA_2,Cancer,DiffBind_counts/BRCA_000CFD9F_ADDF_4304_9E60_6041549E189C_X017_S06_L012_B1_T2_P046
BRCA_3,Cancer,DiffBind_counts/BRCA_01112370_4F6F_4A20_9BE0_7975C3465268_X017_S04_L007_B1_T1_P042
BRCA_4,Cancer,DiffBind_counts/BRCA_01112370_4F6F_4A20_9BE0_7975C3465268_X017_S04_L008_B1_T2_P044
BRCA_5,Cancer,DiffBind_counts/BRCA_0142AAAC_FFE8_43B7_AB99_02F7A1740567_X022_S06_L057_B1_T1_P050

but when I try

newDBA<- dba(sampleSheet = "sampleSheet.csv")

I get the following message :

1   Cancer  NA counts
Error in sum(counts) : invalid 'type' (character) of argument
In addition: Warning messages:
1: In peakOrder(peaks[, 1], as.integer(peaks[, 2]), as.integer(peaks[,  :
  NAs introduced by coercion
2: In peakOrder(peaks[, 1], as.integer(peaks[, 2]), as.integer(peaks[,  :
  NAs introduced by coercion

I would appreciate it if you can point which part of my code is wrong . Thank you

ADD REPLY
1
Entering edit mode
Rory Stark ★ 4.0k
@rory-stark-5741
Last seen 1 day ago
CRUK, Cambridge, UK

Regarding the performance, make sure you are on the very latest version (DiffBind_3.2.4) -- there is a performance improvement that hugely speeds up adding a lot of samples. However a 100,000 interval x 400 sample matrix will take a lot of memory, so if you have limited memory it will still take a long time no matter how you do it.

Regarding your code, I see two issues:

  1. The column names need to be exact (including case): SampleID, Condition
  2. In the count files you are writing out, you should not include the column headers, so set col.names=FALSE.

So building on the code snippet from my previous response:

peaks <- dba.peakset(tamoxifen, bRetrieve = TRUE, DataType=DBA_DATA_FRAME)
counts1 <- cbind(peaks[,1:3],countMatrix[,1])
counts2 <- cbind(peaks[,1:3],countMatrix[,2])

write.table(counts1, "samp1", row.names=FALSE, col.names=FALSE,sep="\t", 
            quote=FALSE)
write.table(counts2, "samp2", row.names=FALSE, col.names=FALSE,sep="\t", 
            quote=FALSE)

newDBA <- dba.peakset(NULL, peaks=peaks, sampID="samp1", 
                      condition="cond1", counts="samp1")
newDBA <- dba.peakset(newDBA, peaks=peaks, sampID="samp2", 
                      condition="cond2", counts="samp2")

newDBA
ADD COMMENT
0
Entering edit mode

Thank you very much

ADD REPLY

Login before adding your answer.

Traffic: 385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6