Entering edit mode
Riba Michela ▴ 90
Last seen 17 months ago


I would like to perform Differential Binding Analysis in the following situation:

  • I have a dataset derived from experimental data I have processed on my own and for which I have already done the differential analysis using DiffBind
  • I know would like to integrate a second dataset derived from literature, for that I would like to use pre-calculated peak counts: I have a peak per sample dataframe woth additional colums with definition of the peaks (CHR, Start, End)

I used the function

A=dba(...) # contructed with metadata, bam files, peak files, ...
B=dba.peakset(dba=A,counts=...) #I put in the counts a dataframe of 4 coulms of which: (CHR, Start, End, Score)

However I noticed that in the B$binding I do not have the correct chromosomal position, which are NAs and in. addition to that the score regarding the added sample is taken as character

the same happens if I try to simply construct a dba.peakset without adding to a pre-existing dba object:

I hypothesize that there might be some issues in my column names and classes in the dataframe, affecting specifically the dba object$binding matrix

I found a similar issue for me when I prepared a Granges from the datafraem of dba$binding: in particular I found that in the calculated RPKM score matrix I had chr23 instead of chrX, may be I do not know how to fix

DC.RPKM <- dba.count(DBA = DC,score=DBA_SCORE_RPKM)

rpkm <- as.data.frame(DC.RPKM$binding)

rpkm$CHR <- gsub(pattern = "23",replacement = "X",rpkm.df$CHR)

rpkm.df <- data.frame(rpkm,                   row.names = paste0(rpkm$CHR,"_",rpkm$START,"_",rpkm$END))

rpkm.df.GR <- makeGRangesFromDataFrame(df = rpkm.df,seqnames.field = "CHR", start.field = "START",end.field = "END",keep.extra.columns = TRUE)

I would appreciate very much if you could help to fix because I found DiffBind really useful for our analyses!!

I would like to know

  • how to add a pre calculated count matrix for various samples to an existind dba object and add also a metadata related to the newly added samples to go on for a comprehensive differential analysis on two (or more) datasets
  • how to fix an issue on the resulting binding score matrix probably relatede to names of the chromosomes or names of the colums specifying the peak positions

I thank you very much,


sessionInfo( )

``` R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Mojave 10.14.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DiffBind_3.0.15 SummarizedExperiment_1.20.0 Biobase_2.50.0 MatrixGenerics_1.2.1
[5] matrixStats_0.58.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 IRanges_2.24.1
[9] S4Vectors_0.28.1 BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] backports_1.2.1 GOstats_2.56.0 BiocFileCache_1.14.0 plyr_1.8.6
[5] GSEABase_1.52.1 splines_4.0.3 BiocParallel_1.24.1 ggplot2_3.3.3
[9] amap_0.8-18 digest_0.6.27 invgamma_1.1 GO.db_3.12.1
[13] SQUAREM_2021.1 fansi_0.4.2 magrittr_2.0.1 checkmate_2.0.0
[17] memoise_2.0.0 BSgenome_1.58.0 base64url_1.4 limma_3.46.0
[21] Biostrings_2.58.0 annotate_1.68.0 systemPipeR_1.24.3 askpass_1.1
[25] bdsmatrix_1.3-4 prettyunits_1.1.1 jpeg_0.1-8.1 colorspace_2.0-0
[29] blob_1.2.1 rappdirs_0.3.3 apeglm_1.12.0 ggrepel_0.9.1
[33] dplyr_1.0.5 crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2
[37] graph_1.68.0 genefilter_1.72.1 brew_1.0-6 survival_3.2-10
[41] VariantAnnotation_1.36.0 glue_1.4.2 gtable_0.3.0 zlibbioc_1.36.0
[45] XVector_0.30.0 DelayedArray_0.16.3 V8_3.4.0 Rgraphviz_2.34.0
[49] scales_1.1.1 pheatmap_1.0.12 mvtnorm_1.1-1 DBI_1.1.1
[53] edgeR_3.32.1 Rcpp_1.0.6 xtable_1.8-4 progress_1.2.2
[57] emdbook_1.3.12 bit_4.0.4 rsvg_2.1 AnnotationForge_1.32.0
[61] truncnorm_1.0-8 httr_1.4.2 gplots_3.1.1 RColorBrewer_1.1-2
[65] ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.6 dbplyr_2.1.1
[69] locfit_1.5-9.4 utf8_1.2.1 tidyselect_1.1.0 rlang_0.4.10
[73] AnnotationDbi_1.52.0 munsell_0.5.0 tools_4.0.3 cachem_1.0.4
[77] generics_0.1.0 RSQLite_2.2.6 stringr_1.4.0 fastmap_1.1.0
[81] yaml_2.2.1 bit64_4.0.5 caTools_1.18.2 purrr_0.3.4
[85] RBGL_1.66.0 xml2_1.3.2 biomaRt_2.46.3 compiler_4.0.3
[89] rstudioapi_0.13 curl_4.3 png_0.1-8 tibble_3.1.0
[93] stringi_1.5.3 GenomicFeatures_1.42.3 lattice_0.20-41 Matrix_1.3-2
[97] vctrs_0.3.7 pillar_1.6.0 lifecycle_1.0.0 irlba_2.3.3
[101] data.table_1.14.0 bitops_1.0-6 rtracklayer_1.50.0 R6_2.5.0
[105] latticeExtra_0.6-30 hwriter_1.3.2 ShortRead_1.48.0 KernSmooth_2.23-18
[109] MASS_7.3-53.1 gtools_3.8.2 assertthat_0.2.1 openssl_1.4.3
[113] Category_2.56.0 rjson_0.2.20 withr_2.4.1 GenomicAlignments_1.26.0 [117] batchtools_0.9.15 Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 hms_1.0.0
[121] grid_4.0.3 DOT_0.1 coda_0.19-4 GreyListChIP_1.22.0
[125] ashr_2.2-47 mixsqp_0.3-43 bbmle_1.0.23.1 numDeriv_2020.2-1

ATACSeq DiFFBind • 849 views
Entering edit mode
Rory Stark ★ 4.9k
Last seen 8 hours ago
CRUK, Cambridge, UK

I'll answer the second question first. You should not accessing the $binding matrix directly -- the chromosome names are factor numbers, not chromosome numbers. The "correct" way to retrieve the binding matrix is using dba.peakset():

rpkm.df.GR <- dba.peakset(DC.RPKM, bRetrieve=TRUE)

which will return a GRanges object with all the consensus peaks, including the correct chromosome names, as well as the associated RPKM values.

There are a couple of issues regarding adding count data directly to DiffBind. The counts must exactly correspond to the intervals in the binding matrix. When you add them by calling dba.peakset() using the counts= parameter, you have to add them one sample at a time. So instead of a matrix of counts, you make a call for each column in the count matrix, calling dba.peakset() separately for each sample. (This way you can also supply district metadata information appropriate for each sample).

Entering edit mode


thank you so much for your kind reply.

  • I would first go and check my calculations about the extraction of the "binding" matrix, using the correct procedure you have highlighted, using your function dba.peakset()
  • regarding this second point I have exacly followed that suggestion, i.e. enter one "sample" from a matrix at time,

`metadata <- data.frame(SampleID=colnames(Vi.part)[c(4:12)],Tissue=as.factor(c(rep("a",3),rep("b",3), rep("c",3))),Replicate=as.factor(rep(c("1","2","3"),3)),row.names = colnames(Vi.part)[c(4:12)]) # here I create the metadata starting from the columns of the data count matrix

dba.1 <- dba.peakset(peaks = Vi.part.df.list[[1]],sampID=names(Vi.part.df.list)[[1]],tissue = metadata$Tissue[1],replicate = metadata$Replicate[1]) # I create the first peakset using counts related to the first sample and providing metadata

dba.1 <- dba(dba.1) # I create the dba object

dba.2 <- dba.peakset(DBA = dba.1,peaks = Vi.part.df.list[[2]],sampID=names(Vi.part.df.list)[[2]],tissue = metadata$Tissue[2],replicate = metadata$Replicate[2]) # I add the second peakset to the first dba object`

My final goal would be merging two datasets (one processed using DiffBind from raw data , and the second derived from a matrix of public data, usually counts in defined inytervals) into one Diffbind object and process them together

Thank you so much! And sorry if I have added another question!


Entering edit mode

I'm having a little bit of trouble following your code, as I'm not sure exactly what, e.g. Vi.part.df.list looks like, it I can make some comments:

  • In order to add counts directly, every sample must have exactly the same peak intervals. I notice that in one case you are passing peaks as peaks=Vi.part.df.list[[1]], while in the second case you are passing peaks as peaks=Vi.part.df.list[[2]], which suggests there may be different peaksets for different samples. (The value for peaks= should be a GRanges object or a dataframe with the Chr, Start, and End, specifying only the intervals, not the count values.

  • I don't see where you are passing in the actual counts. You should include a counts= parameter set to either a vector of count values (same length as number of intervals in the peaks parameter) or a filename where the counts are stored.

  • Call dba.peakset() once for every sample in the second dataset. Alternatively, you can use dba.count() to count the first dataset then successively call dba.peakset() for each sample in the second dataset, so long as the intervals match exactly.

  • You don't need to call dba() after each call; dba.peakset() returns a DBA object with the new sample incorporated into it.

  • When I do an analysis using pre-computed read counts, I store each of the count vectors in separate files and point to them in the sample sheet (using the Counts column) and just load the whole thing using dba() with the samplesheet= parameter set to the samplesheet.

Entering edit mode


I thank you very much, I am going to prepare the samplesheet, because I understand from your explanation that is the most efficient way to read in a pre-computed dataset. I am just adding some information about what my previous objects look like, sorry for I did not in advance and some required explanations: I am starting from a matrix of pre-computed counts on the same intervals

   Chr  Start    End 4983-HSC 4983-MPP 4983-LMPP 4983-CMP
1 chr1  10025  10525        1        0         4        2
2 chr1  13252  13752        0        5         6       11
3 chr1  16019  16519        4        2         6        8
4 chr1  96376  96876        0        0         0        0
5 chr1 115440 115940        0        3         0        0
6 chr1 235393 235893        0        2         2        3

List of 9
 $ 4983-HSC  :'data.frame': 100 obs. of  4 variables:
  ..$ Chr     : chr [1:100] "chr1" "chr1" "chr1" "chr1" ...
  ..$ Start   : int [1:100] 10025 13252 16019 96376 115440 235393 236835 237535 240811 243871 ...
  ..$ End     : int [1:100] 10525 13752 16519 96876 115940 235893 237335 238035 241311 244371 ...
  ..$ 4983-HSC: int [1:100] 1 0 4 0 0 0 0 6 0 0 

in which each component of the list contains the Chr, Start, End columns and a "sample" count column, names as the name of the sample. Afterwards I prepare a metadata, which looks like that

             SampleID Tissue Replicate
4983-HSC     4983-HSC      a         1
4983-MPP     4983-MPP      a         2
4983-LMPP   4983-LMPP      a         3
4983-CMP     4983-CMP      b         1
4983-CMP.1 4983-CMP.1      b         2
4983-GMP     4983-GMP      b         3

Thta following part is the problematic as you are pointing:

dba.1 <- dba.peakset(peaks = Vi.part.df.list[[1]],sampID=names(Vi.part.df.list)[[1]],tissue = metadata$Tissue[1],replicate = metadata$Replicate[1]) # creo il primo peakset

Actually here I did not explicitly enter the count parameter but the peaks instead, maybe I did not understand correctly from here: https://rdrr.io/bioc/DiffBind/man/dba.peakset.html where I found the following:

peaks When adding a specified peakset: set of peaks, either a GRanges object, or a peak dataframe or matrix (chr,start,end,score),

and for this reason since I exacly constructed a list of dataframes with Chromosome, Start, End an something as "score" I thoght that could be sufficient

Than I prepared the first dba from the first peakset

dba.1 <- dba(dba.1)# 

on which I wanted to do what you were suggesting, adding the additional samples on that first, by adding one by one a peakset

dba.2 <- dba.peakset(DBA = dba.1,peaks = Vi.part.df.list[[2]],sampID=names(Vi.part.df.list)[[2]],tissue = metadata$Tissue[2],replicate = metadata$Replicate[2])

dba.3 <- dba.peakset(DBA = dba.2,peaks = Vi.part.df.list[[3]],sampID=names(Vi.part.df.list)[[3]],tissue = metadata$Tissue[3],replicate = metadata$Replicate[3])

and so on; however I did not find this an efficient way of upload, I am sure I'd better go and divide the dataset in file and upload the way you suggested

Afterward however I would like to merge this dataset with an experimental one which in the end could be let say summarized into a matrix but with different intervals, beacuse I have processed the bam file and Macs using your DiffBind. Considering I would arrive to have 2 dba objects (the pre-computed matrix) and the "analysed" experimental samples which is the safer way to combine those datasets into a kind of integrative analysis?

Thank you very much!!



Login before adding your answer.

Traffic: 216 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6