Question: Exporting dba.count result as a data frame
0
gravatar for mm2489
4.6 years ago by
mm24890
United States
mm24890 wrote:

Hi 

 

I am a user of DiffBind and I'm having a hard time trying to export the result of dba.count as a data frame. Basically, I need a table of consensus peaks in rows and the counts for each sample in columns. How can I go about getting it?

 

Any help would be greatly appreciated.

diffbind • 2.0k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by mm24890
Answer: Exporting dba.count result as a data frame
1
gravatar for Rory Stark
4.6 years ago by
Rory Stark3.0k
CRUK, Cambridge, UK
Rory Stark3.0k wrote:

Hi-

The easiest way to get the full matrix of counts is using the dba.peakset() function. This returns a GRanges object by default, but you can tell it to return a dataframe:

> counts <- dba.peakset(DBA, bRetrieve=T, DataType=DBA_DATA_FRAME)

If you actually want raw read counts, you have to change the default score (which is normalized counts after subtracting the control reads if present). you can change between scoring methods using dba.count() with peaks=NULL. For example:

> DBA <- dba.count(DBA, peaks=NULL, score=DBA_SCORE_READS)
> counts <- dba.peakset(DBA, bRetrieve=T, DataType=DBA_DATA_FRAME)

Also, if you just want the counts and not the actual site co-ordinates, you can strip them off:

> counts <- counts[,-(1:3)]

Hope this answers your question.

Cheers-

Rory

ADD COMMENTlink written 4.6 years ago by Rory Stark3.0k

Hi Rory. I am using DiffBind for counting reads. 

But I get the same count table even I specify score differently:

library(DiffBind)

samples<- read.csv("H3K27ac.csv")

H3K27ac<-  dba(sampleSheet="H3K27ac.csv")

## on HPC, I want to make it Parallel 
H3K27ac_InputSubtract<- dba.count(H3K27ac, minOverlap=3, 
                      fragmentSize = 200, bParallel = T,
                      score = "DBA_SCORE_READS_MINUS")

H3K27ac_noInputSubtract<- dba.count(H3K27ac, minOverlap=3, 
                    fragmentSize = 200, bParallel = T,
                    score = "DBA_SCORE_READS")

dba.peakset (H3K27ac_InputSubtract, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, writeFile = "H3K27ac_InputSubtracted_counts.txt")
dba.peakset (H3K27ac_noInputSubtract, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, writeFile = "H3K27ac_noInputSubtracted_counts.txt")

R version 3.1.0 (2014-04-10)

Platform: x86_64-unknown-linux-gnu (64-bit)


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     


other attached packages:

 [1] DiffBind_1.12.3         GenomicAlignments_1.2.2 Rsamtools_1.18.3       

 [4] Biostrings_2.34.1       XVector_0.6.0           limma_3.22.7           

 [7] GenomicRanges_1.18.4    GenomeInfoDb_1.2.5      IRanges_2.0.1          

[10] S4Vectors_0.4.0         BiocGenerics_0.12.1    


loaded via a namespace (and not attached):

 [1] amap_0.8-14        base64enc_0.1-3    BatchJobs_1.6      BBmisc_1.9        

 [5] BiocParallel_1.0.3 bitops_1.0-6       brew_1.0-6         caTools_1.17.1    

 [9] checkmate_1.6.3    codetools_0.2-14   DBI_0.3.1          digest_0.6.8      

[13] edgeR_3.8.6        fail_1.3           foreach_1.4.3      gdata_2.17.0      

[17] gplots_2.17.0      grid_3.1.0         gtools_3.5.0       iterators_1.0.8   

[21] KernSmooth_2.23-15 lattice_0.20-33    magrittr_1.5       RColorBrewer_1.1-2

[25] RSQLite_1.0.0      sendmailR_1.2-1    stringi_1.0-1      stringr_1.0.0     

[29] tools_3.1.0        zlibbioc_1.12.0   
ADD REPLYlink written 4.0 years ago by tangming2005150

In this case, the only thing you are changing is the inclusion of Control reads. The most likely explanation is that the Control reads aren't being properly specified.

If you could send me a copy of your object (H3K27ac_InputSubtract), I can have a look. You could also post the results of:

> H3K27ac_InputSubtract$peaks[[1]][1:20,]

One thing worth knowing is that you don't need to re-count all the data to change the score. You could perform the second operation very quickly by setting peaks=NULL as follows:

> H3K27ac_noInputSubtract<- dba.count(H3K27ac_InputSubtract, peaks=NULL, 
                                      score = "DBA_SCORE_READS")

Cheers-

Rory

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Rory Stark3.0k
H3K27ac_InputSubtract$peaks[[1]][1:20,]

    Chr   Start     End Score      RPKM Reads     cRPKM cReads

1  chr1  752472  759042    86 1.0705038   220 0.7188937    134

2  chr1  845566  847753   -10 0.6285655    43 0.8541857     53

3  chr1  850918  861812    24 1.0124244   345 1.0385870    321

4  chr1  946879  951106   144 1.5731205   208 0.5336702     64

5  chr1  953846  963471    80 1.0097265   304 0.8202996    224

6  chr1  966517  978024    27 0.9418212   339 0.9556915    312

7  chr1  993603 1000013   -34 0.8179311   164 1.0887606    198

8  chr1 1002011 1005956   -24 0.7698524    95 1.0632250    119

9  chr1 1051185 1054117    44 1.1230631   103 0.7092728     59

10 chr1 1056673 1059141    53 1.4248804   110 0.8140572     57

11 chr1 1062761 1074546   136 0.9711456   358 0.6639703    222

12 chr1 1091609 1097213   157 1.7912756   314 0.9874765    157

13 chr1 1107283 1109601    -2 0.6895845    50 0.7907062     52

14 chr1 1114381 1122110    41 1.0257919   248 0.9440006    207

15 chr1 1133564 1149704    26 0.7823921   395 0.8058386    369

16 chr1 1162858 1177059    84 1.0377982   461 0.9357238    377

17 chr1 1216491 1224276    21 0.9855610   240 0.9915411    219

18 chr1 1225426 1249015   406 1.4352162  1059 0.9757283    653

19 chr1 1254386 1261492   175 1.6196016   360 0.9176388    185

20 chr1 1278700 1285390   143 1.4861586   311 0.8851327    168

I do see some negative numbers after exporting the count table, so the control reads 

are subtracted. 

 

==> H3K27ac_InputSubtracted_counts.txt <==

chr1    752472    759042    86    129    16    77    3    166    55    123    -27

chr1    845566    847753    -10    37    16    13    38    16    15    47    1

chr1    850918    861812    24    159    74    128    64    106    -17    181    -67

chr1    946879    951106    144    209    46    136    47    362    49    161    11

chr1    953846    963471    80    305    18    126    148    230    -27    190    -48

chr1    966517    978024    27    236    23    113    76    136    -28    88    -87

chr1    993603    1000013    -34    82    22    66    1    38    -12    77    -95

chr1    1002011    1005956    -24    61    12    46    -3    33    16    21    -47

chr1    1051185    1054117    44    41    16    41    15    61    9    63    -10

chr1    1056673    1059141    53    105    15    74    26    40    11    67    -39


==> H3K27ac_noInputSubtracted_counts.txt <==

chr1    752472    759042    86    129    16    77    3    166    55    123    -27

chr1    845566    847753    -10    37    16    13    38    16    15    47    1

chr1    850918    861812    24    159    74    128    64    106    -17    181    -67

chr1    946879    951106    144    209    46    136    47    362    49    161    11

chr1    953846    963471    80    305    18    126    148    230    -27    190    -48

chr1    966517    978024    27    236    23    113    76    136    -28    88    -87

chr1    993603    1000013    -34    82    22    66    1    38    -12    77    -95

chr1    1002011    1005956    -24    61    12    46    -3    33    16    21    -47

chr1    1051185    1054117    44    41    16    41    15    61    9    63    -10

chr1    1056673    1059141    53    105    15    74    26    40    11    67    -39
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by tangming2005150

You can try emailing me the object and I can try to debug it. I may not be able to debug it exactly as you are using an older version of DiffBind that is no longer supported, but I'll see what I can do.

-R

ADD REPLYlink written 4.0 years ago by Rory Stark3.0k

Thanks, Let me upgrade DiffbBind and come back to you later.

ADD REPLYlink written 4.0 years ago by tangming2005150

Hi Rory,

I upgraded DiffBind to the newest version, and I still get the same count table (input subtracted):

H3K27ac_InputSubtract<- dba.count(H3K27ac, minOverlap=3, 

                      fragmentSize = 200, bParallel = T,
                      score = "DBA_SCORE_READS_MINUS")

## faster

H3K27ac_noInputSubtract<- dba.count(H3K27ac_InputSubtract, peaks=NULL, 
                                    score = "DBA_SCORE_READS")

dba.peakset (H3K27ac_InputSubtract, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, writeFile = "H3K27ac_InputSubtracted_counts.txt")
dba.peakset (H3K27ac_noInputSubtract, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, writeFile = "H3K27ac_noInputSubtracted_counts.txt")


> sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     


other attached packages:

 [1] DiffBind_1.16.0            RSQLite_1.0.0             

 [3] DBI_0.3.1                  locfit_1.5-9.1            

 [5] GenomicAlignments_1.6.1    Rsamtools_1.22.0          

 [7] Biostrings_2.38.2          XVector_0.10.0            

 [9] limma_3.26.3               SummarizedExperiment_1.0.1

[11] Biobase_2.30.0             GenomicRanges_1.22.1      

[13] GenomeInfoDb_1.6.1         IRanges_2.4.5             

[15] S4Vectors_0.8.4            BiocGenerics_0.16.1       


loaded via a namespace (and not attached):

 [1] Rcpp_0.12.2            lattice_0.20-33        GO.db_3.2.2           

 [4] gtools_3.5.0           digest_0.6.8           plyr_1.8.3            

 [7] BatchJobs_1.6          futile.options_1.0.0   ShortRead_1.28.0      

[10] ggplot2_1.0.1          gplots_2.17.0          zlibbioc_1.16.0       

[13] GenomicFeatures_1.22.6 annotate_1.48.0        gdata_2.17.0          

[16] Matrix_1.2-3           checkmate_1.6.3        systemPipeR_1.4.7     

[19] proto_0.3-10           GOstats_2.36.0         splines_3.2.2         

[22] BiocParallel_1.4.0     stringr_1.0.0          pheatmap_1.0.7        

[25] RCurl_1.95-4.7         biomaRt_2.26.1         munsell_0.4.2         

[28] sendmailR_1.2-1        rtracklayer_1.30.1     base64enc_0.1-3       

[31] BBmisc_1.9             fail_1.3               edgeR_3.12.0          

[34] XML_3.98-1.3           AnnotationForge_1.12.1 MASS_7.3-45           

[37] bitops_1.0-6           grid_3.2.2             RBGL_1.46.0           

[40] xtable_1.8-0           GSEABase_1.32.0        gtable_0.1.2          

[43] magrittr_1.5           scales_0.3.0           graph_1.48.0          

[46] KernSmooth_2.23-15     amap_0.8-14            stringi_1.0-1         

[49] hwriter_1.3.2          reshape2_1.4.1         genefilter_1.52.0     

[52] latticeExtra_0.6-26    futile.logger_1.4.1    brew_1.0-6            

[55] rjson_0.2.15           lambda.r_1.1.7         RColorBrewer_1.1-2    

[58] tools_3.2.2            Category_2.36.0        survival_2.38-3       

[61] AnnotationDbi_1.32.2   colorspace_1.2-6       caTools_1.17.1   

Could you please just check your own data set to see if you can reproduce it? I am not sure how I can send my object to you.  Thanks!

 

Ming

 

ADD REPLYlink written 4.0 years ago by tangming2005150

Hi Rory,

I upgraded DiffBind to the newest version, and I still get the same count table (input subtracted):

H3K27ac_InputSubtract<- dba.count(H3K27ac, minOverlap=3, 

                      fragmentSize = 200, bParallel = T,
                      score = "DBA_SCORE_READS_MINUS")

## faster

H3K27ac_noInputSubtract<- dba.count(H3K27ac_InputSubtract, peaks=NULL, 
                                    score = "DBA_SCORE_READS")

dba.peakset (H3K27ac_InputSubtract, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, writeFile = "H3K27ac_InputSubtracted_counts.txt")
dba.peakset (H3K27ac_noInputSubtract, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, writeFile = "H3K27ac_noInputSubtracted_counts.txt")


> sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     


other attached packages:

 [1] DiffBind_1.16.0            RSQLite_1.0.0             

 [3] DBI_0.3.1                  locfit_1.5-9.1            

 [5] GenomicAlignments_1.6.1    Rsamtools_1.22.0          

 [7] Biostrings_2.38.2          XVector_0.10.0            

 [9] limma_3.26.3               SummarizedExperiment_1.0.1

[11] Biobase_2.30.0             GenomicRanges_1.22.1      

[13] GenomeInfoDb_1.6.1         IRanges_2.4.5             

[15] S4Vectors_0.8.4            BiocGenerics_0.16.1       


loaded via a namespace (and not attached):

 [1] Rcpp_0.12.2            lattice_0.20-33        GO.db_3.2.2           

 [4] gtools_3.5.0           digest_0.6.8           plyr_1.8.3            

 [7] BatchJobs_1.6          futile.options_1.0.0   ShortRead_1.28.0      

[10] ggplot2_1.0.1          gplots_2.17.0          zlibbioc_1.16.0       

[13] GenomicFeatures_1.22.6 annotate_1.48.0        gdata_2.17.0          

[16] Matrix_1.2-3           checkmate_1.6.3        systemPipeR_1.4.7     

[19] proto_0.3-10           GOstats_2.36.0         splines_3.2.2         

[22] BiocParallel_1.4.0     stringr_1.0.0          pheatmap_1.0.7        

[25] RCurl_1.95-4.7         biomaRt_2.26.1         munsell_0.4.2         

[28] sendmailR_1.2-1        rtracklayer_1.30.1     base64enc_0.1-3       

[31] BBmisc_1.9             fail_1.3               edgeR_3.12.0          

[34] XML_3.98-1.3           AnnotationForge_1.12.1 MASS_7.3-45           

[37] bitops_1.0-6           grid_3.2.2             RBGL_1.46.0           

[40] xtable_1.8-0           GSEABase_1.32.0        gtable_0.1.2          

[43] magrittr_1.5           scales_0.3.0           graph_1.48.0          

[46] KernSmooth_2.23-15     amap_0.8-14            stringi_1.0-1         

[49] hwriter_1.3.2          reshape2_1.4.1         genefilter_1.52.0     

[52] latticeExtra_0.6-26    futile.logger_1.4.1    brew_1.0-6            

[55] rjson_0.2.15           lambda.r_1.1.7         RColorBrewer_1.1-2    

[58] tools_3.2.2            Category_2.36.0        survival_2.38-3       

[61] AnnotationDbi_1.32.2   colorspace_1.2-6       caTools_1.17.1   

Could you please just check your own data set to see if you can reproduce it? I am not sure how I can send my object to you.  Thanks!

 

Ming

 

ADD REPLYlink written 4.0 years ago by tangming2005150

You didn't show that the counts are still different above, I assume you have checked?

I can not reproduce this with any of my own datasets  -- I have tried on a variety of previous and current versions.

Could you send me a link to your DBA object H3K27ac_InputSubtract? Easiest way is usually Dropbox or some other link. Send the link to my email (rory.stark@cruk.cam.ac.uk).

Cheers-

Rory

 

ADD REPLYlink written 4.0 years ago by Rory Stark3.0k

Yes, I checked, the tables are the same (counts after subtracting)

save(H3K27ac_InputSubtract, file = "H3K27ac_InputSubtract.rda")

I just sent you the object. Thanks very much!

ADD REPLYlink written 4.0 years ago by tangming2005150

I can't reproduce this with the object you sent (I tried on three different version of DiffBind). 

Here is the script I ran:

> library(DiffBind)
> load('H3K27ac_InputSubtract.rda')
> sessionInfo()$otherPkgs$DiffBind$Version
[1] "1.16.0"
> reads_minus <- dba.peakset(H3K27ac_InputSubtract,bRetrieve=T,
                             writeFile="H3K27ac_InputSubtract.txt",
                             DataType=DBA_DATA_FRAME)
> reads_minus[1:10,1:6]
    CHR   START     END TCGA-08-0352 TCGA-08-0359 TCGA-27-1833
1  chr1  752472  759042           86          129           16
2  chr1  845566  847753          -10           37           16
3  chr1  850918  861812           24          159           74
4  chr1  946879  951106          144          209           46
5  chr1  953846  963471           80          305           18
6  chr1  966517  978024           27          236           23
7  chr1  993603 1000013          -34           82           22
8  chr1 1002011 1005956          -24           61           12
9  chr1 1051185 1054117           44           41           16
10 chr1 1056673 1059141           53          105           15
> H3K27ac_noInputSubtract <- dba.count(H3K27ac_InputSubtract, peaks=NULL,
                                       score=DBA_SCORE_READS)
> reads_only <- dba.peakset(H3K27ac_noInputSubtract,bRetrieve=T,
                            writeFile="H3K27ac_noInputSubtract.txt",
                            DataType=DBA_DATA_FRAME)      
> reads_only[1:10,1:6]
    CHR   START     END TCGA-08-0352 TCGA-08-0359 TCGA-27-1833
1  chr1  752472  759042          220          224           41
2  chr1  845566  847753           43           90           29
3  chr1  850918  861812          345          491          145
4  chr1  946879  951106          208          302           58
5  chr1  953846  963471          304          512           75
6  chr1  966517  978024          339          555          113
7  chr1  993603 1000013          164          223           63
8  chr1 1002011 1005956           95          171           49
9  chr1 1051185 1054117          103          114           31
10 chr1 1056673 1059141          110          176           29
> reads.minus <- read.table("H3K27ac_InputSubtract.txt")
> reads_minus[1:10,1:6]
    CHR   START     END TCGA-08-0352 TCGA-08-0359 TCGA-27-1833
1  chr1  752472  759042           86          129           16
2  chr1  845566  847753          -10           37           16
3  chr1  850918  861812           24          159           74
4  chr1  946879  951106          144          209           46
5  chr1  953846  963471           80          305           18
6  chr1  966517  978024           27          236           23
7  chr1  993603 1000013          -34           82           22
8  chr1 1002011 1005956          -24           61           12
9  chr1 1051185 1054117           44           41           16
10 chr1 1056673 1059141           53          105           15
> reads.only <- read.table("H3K27ac_noInputSubtract.txt")
> reads_only[1:10,1:6]
    CHR   START     END TCGA-08-0352 TCGA-08-0359 TCGA-27-1833
1  chr1  752472  759042          220          224           41
2  chr1  845566  847753           43           90           29
3  chr1  850918  861812          345          491          145
4  chr1  946879  951106          208          302           58
5  chr1  953846  963471          304          512           75
6  chr1  966517  978024          339          555          113
7  chr1  993603 1000013          164          223           63
8  chr1 1002011 1005956           95          171           49
9  chr1 1051185 1054117          103          114           31
10 chr1 1056673 1059141          110          176           29

Can you run the same commands and see if you get the same results?

-R

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Rory Stark3.0k

Thanks Rory!  I found out the problem... I specified 

score=DBA_SCORE_READS

to score="DBA_SCORE_READS"

I guess the quote is the evil...

ADD REPLYlink written 4.0 years ago by tangming2005150
Answer: Exporting dba.count result as a data frame
0
gravatar for mm2489
4.6 years ago by
mm24890
United States
mm24890 wrote:

Thank you so much Rory, that worked. Now I have my counts table as a data frame and I was able to perform the adjustment that needed to be done to the count values for a specific sample.

Now is it possible to convert this new counts table back to a DBA object to perform the downstream analyses on it (dba.contrast and  dba.analyze) ?

 

Thanks again for your help

ADD COMMENTlink written 4.6 years ago by mm24890

This is a bit less straightforward, as DiffBind does not provide a way to directly change the counts.

There is a feature that does allow you to read the counts into a new DBA object, but this is a bit of a backdoor. It can be done using dba() via the Counts field in the sample sheet, or programatically using dba.peakset() via the counts parameter. It  is probably the easiest in your case to use dba.peakset() and counts, see the man page. Basically you add peakset, one at a time, with the counts.

If you do it using a samplesheet, here are the steps.

  1. Save each sample's counts out to a separate text file. The text file should have two columns: a peak identifier (ie a unique number) and the counts you want. Every count file should have the same number of lines, with one count for each consensus peak.
  2. Create a new sample sheet, with the Peaks and PeakCaller columns stripped, replaced by a Counts column containing the filenames of the counts files.
  3. Load in the sample sheet using dba(). When it finishes, you don't have to run dba.count(), as the counts are already there.

The main downside of doing this as it it currently implemented is that the peak intervals are lost, all you have is the peak number, so when you identify peaks of interest, you have to go back and retrieve the peak interval from the number.

I do use this feature myself -- mostly for reading in RNA-seq gene counts so I can do differential analysis with both edgeR and DESeq2 and have all the plots I like.

Cheers-

Rory

ADD REPLYlink written 4.6 years ago by Rory Stark3.0k

Thanks

I made separate dba objects for my samples using the counts option in dba.peakset. Now how should I combine them all into one DBA?

ADD REPLYlink written 4.6 years ago by mm24890

The first time you call dba.peakset(), the first argument should be NULL, and ti will return a new DBA object. From then on, pass in that DBA object and it will add each peakset to the object.

This is actually what dba() does when processing a sample sheet, by the way. 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Rory Stark3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 264 users visited in the last hour