Goodmorning,
I would like to perform Differential Binding Analysis in the following situation:
- I have a dataset derived from experimental data I have processed on my own and for which I have already done the differential analysis using DiffBind
- I know would like to integrate a second dataset derived from literature, for that I would like to use pre-calculated peak counts: I have a peak per sample dataframe woth additional colums with definition of the peaks (CHR, Start, End)
I used the function
library(DiffBind)
A=dba(...) # contructed with metadata, bam files, peak files, ...
B=dba.peakset(dba=A,counts=...) #I put in the counts a dataframe of 4 coulms of which: (CHR, Start, End, Score)
However I noticed that in the B$binding I do not have the correct chromosomal position, which are NAs and in. addition to that the score regarding the added sample is taken as character
the same happens if I try to simply construct a dba.peakset without adding to a pre-existing dba object:
I hypothesize that there might be some issues in my column names and classes in the dataframe, affecting specifically the dba object$binding matrix
I found a similar issue for me when I prepared a Granges from the datafraem of dba$binding: in particular I found that in the calculated RPKM score matrix I had chr23 instead of chrX, may be I do not know how to fix
library(DiffBind)
DC.RPKM <- dba.count(DBA = DC,score=DBA_SCORE_RPKM)
rpkm <- as.data.frame(DC.RPKM$binding)
rpkm$CHR <- gsub(pattern = "23",replacement = "X",rpkm.df$CHR)
rpkm.df <- data.frame(rpkm, row.names = paste0(rpkm$CHR,"_",rpkm$START,"_",rpkm$END))
library(GenomicRanges)
rpkm.df.GR <- makeGRangesFromDataFrame(df = rpkm.df,seqnames.field = "CHR", start.field = "START",end.field = "END",keep.extra.columns = TRUE)
I would appreciate very much if you could help to fix because I found DiffBind really useful for our analyses!!
I would like to know
- how to add a pre calculated count matrix for various samples to an existind dba object and add also a metadata related to the newly added samples to go on for a comprehensive differential analysis on two (or more) datasets
- how to fix an issue on the resulting binding score matrix probably relatede to names of the chromosomes or names of the colums specifying the peak positions
I thank you very much,
Michela
sessionInfo( )
``` R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Mojave 10.14.6
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DiffBind_3.0.15 SummarizedExperiment_1.20.0 Biobase_2.50.0 MatrixGenerics_1.2.1
[5] matrixStats_0.58.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 IRanges_2.24.1
[9] S4Vectors_0.28.1 BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] backports_1.2.1 GOstats_2.56.0 BiocFileCache_1.14.0 plyr_1.8.6
[5] GSEABase_1.52.1 splines_4.0.3 BiocParallel_1.24.1 ggplot2_3.3.3
[9] amap_0.8-18 digest_0.6.27 invgamma_1.1 GO.db_3.12.1
[13] SQUAREM_2021.1 fansi_0.4.2 magrittr_2.0.1 checkmate_2.0.0
[17] memoise_2.0.0 BSgenome_1.58.0 base64url_1.4 limma_3.46.0
[21] Biostrings_2.58.0 annotate_1.68.0 systemPipeR_1.24.3 askpass_1.1
[25] bdsmatrix_1.3-4 prettyunits_1.1.1 jpeg_0.1-8.1 colorspace_2.0-0
[29] blob_1.2.1 rappdirs_0.3.3 apeglm_1.12.0 ggrepel_0.9.1
[33] dplyr_1.0.5 crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2
[37] graph_1.68.0 genefilter_1.72.1 brew_1.0-6 survival_3.2-10
[41] VariantAnnotation_1.36.0 glue_1.4.2 gtable_0.3.0 zlibbioc_1.36.0
[45] XVector_0.30.0 DelayedArray_0.16.3 V8_3.4.0 Rgraphviz_2.34.0
[49] scales_1.1.1 pheatmap_1.0.12 mvtnorm_1.1-1 DBI_1.1.1
[53] edgeR_3.32.1 Rcpp_1.0.6 xtable_1.8-4 progress_1.2.2
[57] emdbook_1.3.12 bit_4.0.4 rsvg_2.1 AnnotationForge_1.32.0
[61] truncnorm_1.0-8 httr_1.4.2 gplots_3.1.1 RColorBrewer_1.1-2
[65] ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.6 dbplyr_2.1.1
[69] locfit_1.5-9.4 utf8_1.2.1 tidyselect_1.1.0 rlang_0.4.10
[73] AnnotationDbi_1.52.0 munsell_0.5.0 tools_4.0.3 cachem_1.0.4
[77] generics_0.1.0 RSQLite_2.2.6 stringr_1.4.0 fastmap_1.1.0
[81] yaml_2.2.1 bit64_4.0.5 caTools_1.18.2 purrr_0.3.4
[85] RBGL_1.66.0 xml2_1.3.2 biomaRt_2.46.3 compiler_4.0.3
[89] rstudioapi_0.13 curl_4.3 png_0.1-8 tibble_3.1.0
[93] stringi_1.5.3 GenomicFeatures_1.42.3 lattice_0.20-41 Matrix_1.3-2
[97] vctrs_0.3.7 pillar_1.6.0 lifecycle_1.0.0 irlba_2.3.3
[101] data.table_1.14.0 bitops_1.0-6 rtracklayer_1.50.0 R6_2.5.0
[105] latticeExtra_0.6-30 hwriter_1.3.2 ShortRead_1.48.0 KernSmooth_2.23-18
[109] MASS_7.3-53.1 gtools_3.8.2 assertthat_0.2.1 openssl_1.4.3
[113] Category_2.56.0 rjson_0.2.20 withr_2.4.1 GenomicAlignments_1.26.0
[117] batchtools_0.9.15 Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 hms_1.0.0
[121] grid_4.0.3 DOT_0.1 coda_0.19-4 GreyListChIP_1.22.0
[125] ashr_2.2-47 mixsqp_0.3-43 bbmle_1.0.23.1 numDeriv_2020.2-1
Hi,
thank you so much for your kind reply.
dba.peakset()
`metadata <- data.frame(SampleID=colnames(Vi.part)[c(4:12)],Tissue=as.factor(c(rep("a",3),rep("b",3), rep("c",3))),Replicate=as.factor(rep(c("1","2","3"),3)),row.names = colnames(Vi.part)[c(4:12)]) # here I create the metadata starting from the columns of the data count matrix
dba.1 <- dba.peakset(peaks = Vi.part.df.list[[1]],sampID=names(Vi.part.df.list)[[1]],tissue = metadata$Tissue[1],replicate = metadata$Replicate[1]) # I create the first peakset using counts related to the first sample and providing metadata
dba.1 <- dba(dba.1) # I create the dba object
dba.2 <- dba.peakset(DBA = dba.1,peaks = Vi.part.df.list[[2]],sampID=names(Vi.part.df.list)[[2]],tissue = metadata$Tissue[2],replicate = metadata$Replicate[2]) # I add the second peakset to the first dba object`
My final goal would be merging two datasets (one processed using DiffBind from raw data , and the second derived from a matrix of public data, usually counts in defined inytervals) into one Diffbind object and process them together
Thank you so much! And sorry if I have added another question!
Michela
I'm having a little bit of trouble following your code, as I'm not sure exactly what, e.g.
Vi.part.df.list
looks like, it I can make some comments:In order to add counts directly, every sample must have exactly the same peak intervals. I notice that in one case you are passing peaks as
peaks=Vi.part.df.list[[1]]
, while in the second case you are passing peaks aspeaks=Vi.part.df.list[[2]]
, which suggests there may be different peaksets for different samples. (The value forpeaks=
should be aGRanges
object or a dataframe with the Chr, Start, and End, specifying only the intervals, not the count values.I don't see where you are passing in the actual counts. You should include a
counts=
parameter set to either a vector of count values (same length as number of intervals in thepeaks
parameter) or a filename where the counts are stored.Call
dba.peakset()
once for every sample in the second dataset. Alternatively, you can usedba.count()
to count the first dataset then successively calldba.peakset()
for each sample in the second dataset, so long as the intervals match exactly.You don't need to call
dba()
after each call;dba.peakset()
returns a DBA object with the new sample incorporated into it.When I do an analysis using pre-computed read counts, I store each of the count vectors in separate files and point to them in the sample sheet (using the
Counts
column) and just load the whole thing usingdba()
with thesamplesheet=
parameter set to the samplesheet.Hi,
I thank you very much, I am going to prepare the samplesheet, because I understand from your explanation that is the most efficient way to read in a pre-computed dataset. I am just adding some information about what my previous objects look like, sorry for I did not in advance and some required explanations: I am starting from a matrix of pre-computed counts on the same intervals
in which each component of the list contains the Chr, Start, End columns and a "sample" count column, names as the name of the sample. Afterwards I prepare a metadata, which looks like that
Thta following part is the problematic as you are pointing:
Actually here I did not explicitly enter the count parameter but the peaks instead, maybe I did not understand correctly from here: https://rdrr.io/bioc/DiffBind/man/dba.peakset.html where I found the following:
and for this reason since I exacly constructed a list of dataframes with Chromosome, Start, End an something as "score" I thoght that could be sufficient
Than I prepared the first dba from the first peakset
on which I wanted to do what you were suggesting, adding the additional samples on that first, by adding one by one a peakset
and so on; however I did not find this an efficient way of upload, I am sure I'd better go and divide the dataset in file and upload the way you suggested
Afterward however I would like to merge this dataset with an experimental one which in the end could be let say summarized into a matrix but with different intervals, beacuse I have processed the bam file and Macs using your DiffBind. Considering I would arrive to have 2 dba objects (the pre-computed matrix) and the "analysed" experimental samples which is the safer way to combine those datasets into a kind of integrative analysis?
Thank you very much!!
Michela