DiffBind
1
0
Entering edit mode
Riba Michela ▴ 80
@riba-michela-6472
Last seen 9 hours ago
Italy

Goodmorning,

I would like to perform Differential Binding Analysis in the following situation:

  • I have a dataset derived from experimental data I have processed on my own and for which I have already done the differential analysis using DiffBind
  • I know would like to integrate a second dataset derived from literature, for that I would like to use pre-calculated peak counts: I have a peak per sample dataframe woth additional colums with definition of the peaks (CHR, Start, End)

I used the function

library(DiffBind)
A=dba(...) # contructed with metadata, bam files, peak files, ...
B=dba.peakset(dba=A,counts=...) #I put in the counts a dataframe of 4 coulms of which: (CHR, Start, End, Score)

However I noticed that in the B$binding I do not have the correct chromosomal position, which are NAs and in. addition to that the score regarding the added sample is taken as character

the same happens if I try to simply construct a dba.peakset without adding to a pre-existing dba object:

I hypothesize that there might be some issues in my column names and classes in the dataframe, affecting specifically the dba object$binding matrix

I found a similar issue for me when I prepared a Granges from the datafraem of dba$binding: in particular I found that in the calculated RPKM score matrix I had chr23 instead of chrX, may be I do not know how to fix

library(DiffBind)
DC.RPKM <- dba.count(DBA = DC,score=DBA_SCORE_RPKM)

rpkm <- as.data.frame(DC.RPKM$binding)

rpkm$CHR <- gsub(pattern = "23",replacement = "X",rpkm.df$CHR)

rpkm.df <- data.frame(rpkm,                   row.names = paste0(rpkm$CHR,"_",rpkm$START,"_",rpkm$END))

library(GenomicRanges)
rpkm.df.GR <- makeGRangesFromDataFrame(df = rpkm.df,seqnames.field = "CHR", start.field = "START",end.field = "END",keep.extra.columns = TRUE)

I would appreciate very much if you could help to fix because I found DiffBind really useful for our analyses!!

I would like to know

  • how to add a pre calculated count matrix for various samples to an existind dba object and add also a metadata related to the newly added samples to go on for a comprehensive differential analysis on two (or more) datasets
  • how to fix an issue on the resulting binding score matrix probably relatede to names of the chromosomes or names of the colums specifying the peak positions

I thank you very much,

Michela

sessionInfo( )

``` R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Mojave 10.14.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DiffBind_3.0.15 SummarizedExperiment_1.20.0 Biobase_2.50.0 MatrixGenerics_1.2.1
[5] matrixStats_0.58.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 IRanges_2.24.1
[9] S4Vectors_0.28.1 BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] backports_1.2.1 GOstats_2.56.0 BiocFileCache_1.14.0 plyr_1.8.6
[5] GSEABase_1.52.1 splines_4.0.3 BiocParallel_1.24.1 ggplot2_3.3.3
[9] amap_0.8-18 digest_0.6.27 invgamma_1.1 GO.db_3.12.1
[13] SQUAREM_2021.1 fansi_0.4.2 magrittr_2.0.1 checkmate_2.0.0
[17] memoise_2.0.0 BSgenome_1.58.0 base64url_1.4 limma_3.46.0
[21] Biostrings_2.58.0 annotate_1.68.0 systemPipeR_1.24.3 askpass_1.1
[25] bdsmatrix_1.3-4 prettyunits_1.1.1 jpeg_0.1-8.1 colorspace_2.0-0
[29] blob_1.2.1 rappdirs_0.3.3 apeglm_1.12.0 ggrepel_0.9.1
[33] dplyr_1.0.5 crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2
[37] graph_1.68.0 genefilter_1.72.1 brew_1.0-6 survival_3.2-10
[41] VariantAnnotation_1.36.0 glue_1.4.2 gtable_0.3.0 zlibbioc_1.36.0
[45] XVector_0.30.0 DelayedArray_0.16.3 V8_3.4.0 Rgraphviz_2.34.0
[49] scales_1.1.1 pheatmap_1.0.12 mvtnorm_1.1-1 DBI_1.1.1
[53] edgeR_3.32.1 Rcpp_1.0.6 xtable_1.8-4 progress_1.2.2
[57] emdbook_1.3.12 bit_4.0.4 rsvg_2.1 AnnotationForge_1.32.0
[61] truncnorm_1.0-8 httr_1.4.2 gplots_3.1.1 RColorBrewer_1.1-2
[65] ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.6 dbplyr_2.1.1
[69] locfit_1.5-9.4 utf8_1.2.1 tidyselect_1.1.0 rlang_0.4.10
[73] AnnotationDbi_1.52.0 munsell_0.5.0 tools_4.0.3 cachem_1.0.4
[77] generics_0.1.0 RSQLite_2.2.6 stringr_1.4.0 fastmap_1.1.0
[81] yaml_2.2.1 bit64_4.0.5 caTools_1.18.2 purrr_0.3.4
[85] RBGL_1.66.0 xml2_1.3.2 biomaRt_2.46.3 compiler_4.0.3
[89] rstudioapi_0.13 curl_4.3 png_0.1-8 tibble_3.1.0
[93] stringi_1.5.3 GenomicFeatures_1.42.3 lattice_0.20-41 Matrix_1.3-2
[97] vctrs_0.3.7 pillar_1.6.0 lifecycle_1.0.0 irlba_2.3.3
[101] data.table_1.14.0 bitops_1.0-6 rtracklayer_1.50.0 R6_2.5.0
[105] latticeExtra_0.6-30 hwriter_1.3.2 ShortRead_1.48.0 KernSmooth_2.23-18
[109] MASS_7.3-53.1 gtools_3.8.2 assertthat_0.2.1 openssl_1.4.3
[113] Category_2.56.0 rjson_0.2.20 withr_2.4.1 GenomicAlignments_1.26.0 [117] batchtools_0.9.15 Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 hms_1.0.0
[121] grid_4.0.3 DOT_0.1 coda_0.19-4 GreyListChIP_1.22.0
[125] ashr_2.2-47 mixsqp_0.3-43 bbmle_1.0.23.1 numDeriv_2020.2-1

ATACSeq DiFFBind • 118 views
ADD COMMENT
0
Entering edit mode
Rory Stark ★ 3.9k
@rory-stark-5741
Last seen 7 hours ago
CRUK, Cambridge, UK

I'll answer the second question first. You should not accessing the $binding matrix directly -- the chromosome names are factor numbers, not chromosome numbers. The "correct" way to retrieve the binding matrix is using dba.peakset():

rpkm.df.GR <- dba.peakset(DC.RPKM, bRetrieve=TRUE)

which will return a GRanges object with all the consensus peaks, including the correct chromosome names, as well as the associated RPKM values.

There are a couple of issues regarding adding count data directly to DiffBind. The counts must exactly correspond to the intervals in the binding matrix. When you add them by calling dba.peakset() using the counts= parameter, you have to add them one sample at a time. So instead of a matrix of counts, you make a call for each column in the count matrix, calling dba.peakset() separately for each sample. (This way you can also supply district metadata information appropriate for each sample).

ADD COMMENT
0
Entering edit mode

Hi,

thank you so much for your kind reply.

  • I would first go and check my calculations about the extraction of the "binding" matrix, using the correct procedure you have highlighted, using your function dba.peakset()
  • regarding this second point I have exacly followed that suggestion, i.e. enter one "sample" from a matrix at time,

`metadata <- data.frame(SampleID=colnames(Vi.part)[c(4:12)],Tissue=as.factor(c(rep("a",3),rep("b",3), rep("c",3))),Replicate=as.factor(rep(c("1","2","3"),3)),row.names = colnames(Vi.part)[c(4:12)]) # here I create the metadata starting from the columns of the data count matrix

dba.1 <- dba.peakset(peaks = Vi.part.df.list[[1]],sampID=names(Vi.part.df.list)[[1]],tissue = metadata$Tissue[1],replicate = metadata$Replicate[1]) # I create the first peakset using counts related to the first sample and providing metadata

dba.1 <- dba(dba.1) # I create the dba object

dba.2 <- dba.peakset(DBA = dba.1,peaks = Vi.part.df.list[[2]],sampID=names(Vi.part.df.list)[[2]],tissue = metadata$Tissue[2],replicate = metadata$Replicate[2]) # I add the second peakset to the first dba object`

My final goal would be merging two datasets (one processed using DiffBind from raw data , and the second derived from a matrix of public data, usually counts in defined inytervals) into one Diffbind object and process them together

Thank you so much! And sorry if I have added another question!

Michela

ADD REPLY
0
Entering edit mode

I'm having a little bit of trouble following your code, as I'm not sure exactly what, e.g. Vi.part.df.list looks like, it I can make some comments:

  • In order to add counts directly, every sample must have exactly the same peak intervals. I notice that in one case you are passing peaks as peaks=Vi.part.df.list[[1]], while in the second case you are passing peaks as peaks=Vi.part.df.list[[2]], which suggests there may be different peaksets for different samples. (The value for peaks= should be a GRanges object or a dataframe with the Chr, Start, and End, specifying only the intervals, not the count values.

  • I don't see where you are passing in the actual counts. You should include a counts= parameter set to either a vector of count values (same length as number of intervals in the peaks parameter) or a filename where the counts are stored.

  • Call dba.peakset() once for every sample in the second dataset. Alternatively, you can use dba.count() to count the first dataset then successively call dba.peakset() for each sample in the second dataset, so long as the intervals match exactly.

  • You don't need to call dba() after each call; dba.peakset() returns a DBA object with the new sample incorporated into it.

  • When I do an analysis using pre-computed read counts, I store each of the count vectors in separate files and point to them in the sample sheet (using the Counts column) and just load the whole thing using dba() with the samplesheet= parameter set to the samplesheet.

ADD REPLY

Login before adding your answer.

Traffic: 422 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6