problem recasting InteractionSet into a DGEList
1
1
Entering edit mode
@marco-blanchette-17000
Last seen 7 months ago
US/Santa Cruz/Dovetail Genomics

Dear community,

We are trying to write wrappers for analysis around the popular cool and mcool data files. For example, I would like to use diffHic to analyze differential interactions between experiments and start directly from the processed interaction pair counts stored in these popular data format. I do not want to re-process the pairs form the alignment files. Following the vignettes for the GenomicInterations and diffHic, I am trying to create an InteractionSet object that would be recast into a DGEList for passing to some of the popular edgeR package functions.

I am attaching a script and a toy example that can be download here (1.8Gb) https://rb.gy/moxn6x. The script uses the cooler CLI found here https://github.com/mirnylab/cooler

The script successfully creates the InteractionSet object but something is not write as it errors when recasting into the DGEList. I can't seem to trace down what is missing...

Any suggestions would be greatly appreciated

library(diffHic)
library(GenomicInteractions)

getGI <- function(mcool.f,res) {
con <- pipe(paste('cooler ls',mcool.f))
close(con)

resolutions <- as.integer(unlist(lapply(strsplit(cool.fs,'/'),function(x) x[length(x)])))

## Should add a tryCatch for testing if resolution exists in file
cool.f <- cool.fs[resolutions == res]
chrs <- read.table(pipe(paste("cooler dump -t chroms",cool.f)))
colnames(chrs) <- c('chr','length')

## TODO: here we will roll over all chromosomes using mclapply
chr <- chrs$chr[21] cmd <- pipe(paste("cooler dump -t pixels -r",chr,"-r2",chr,"--header --join",cool.f)) interaction <- read.table(cmd,header=TRUE,fill=TRUE) gi <- GenomicInteractions(GRanges(interaction$chrom1,IRanges(interaction$start1,interaction$end1)),
GRanges(interaction$chrom2,IRanges(interaction$start2,interaction$end2)), interaction$count)
return(gi)
}

## This will be a series of mcool files
mcool.fs <- c("toy1.mcool","toy2.mcool")

## Get the GenomicInteractions and there counts from mcool files
gi.counts <- lapply(mcool.fs,getGI,res = 32000)

################################################################################
## Reformating the list of GenomicInteractions into an Interactions sets
################################################################################

## First, get the universe of genomic interactions in the list
gi <- unique(do.call(c,gi.counts))

mcols(gi) <- c()

## Get a dataframe of counts for each interaction pairs
counts <- do.call(cbind,lapply(gi.counts,function(x){

## Initialize a vector of counts 0 of the lenght of all interaction
counts <- as.integer(rep(0,length(gi)))

## Find the interactions in the common set intersecting with the current GI
mm <- findMatches(gi,x)

## Replace in the vector of counts for the universe, replace the intersection with the current counts
counts[queryHits(mm)] <- x$counts[subjectHits(mm)] ## Return return(counts) })) ## Asign the name of the Column to the sample colnames(counts) <- mcool.fs ## Get the library size lib.data <- DataFrame(lib.sizes=colSums(counts)) ## Create the interaction set data <- InteractionSet(counts, gi, colData=lib.data) ################################################################################ ## Trying to follow the diffHiC vignette ################################################################################ library(edgeR) ## Casting teh InteractionSet as a DGEList for using countCPM data.DGEL <- asDGEList(data) ## Throwing an error... Can't find why...  diffHic GenomicInteractions edgeR HiC cooler • 459 views ADD COMMENT 0 Entering edit mode here is the full error text and the session info Error:  > data.DGEL <- asDGEList(data) Error in assay(object, i = assay.id, withDimnames = FALSE) : 'assay(<InteractionSet>, i="character", ...)' invalid subscript 'i' 'counts' not in names(assays(<InteractionSet>)) In addition: Warning message: In .local(object, ...) : library sizes not found in 'totals', setting to NULL  sessionInfo(): > sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.30.3 limma_3.44.3 [3] GenomicInteractions_1.22.0 diffHic_1.20.0 [5] InteractionSet_1.16.0 SummarizedExperiment_1.18.2 [7] DelayedArray_0.14.1 matrixStats_0.56.0 [9] Biobase_2.48.0 GenomicRanges_1.40.0 [11] GenomeInfoDb_1.24.2 IRanges_2.22.2 [13] S4Vectors_0.26.1 BiocGenerics_0.34.0 loaded via a namespace (and not attached): [1] ProtGenerics_1.20.0 bitops_1.0-6 bit64_4.0.2 [4] RColorBrewer_1.1-2 progress_1.2.2 httr_1.4.2 [7] backports_1.1.9 tools_4.0.2 R6_2.4.1 [10] rpart_4.1-15 lazyeval_0.2.2 Hmisc_4.4-1 [13] DBI_1.1.0 Gviz_1.32.0 colorspace_1.4-1 [16] nnet_7.3-14 tidyselect_1.1.0 gridExtra_2.3 [19] prettyunits_1.1.1 bit_4.0.4 curl_4.3 [22] compiler_4.0.2 htmlTable_2.0.1 csaw_1.22.1 [25] rtracklayer_1.48.0 checkmate_2.0.0 scales_1.1.1 [28] askpass_1.1 rappdirs_0.3.1 stringr_1.4.0 [31] digest_0.6.25 Rsamtools_2.4.0 foreign_0.8-80 [34] XVector_0.28.0 dichromat_2.0-0 htmltools_0.5.0 [37] base64enc_0.1-3 jpeg_0.1-8.1 pkgconfig_2.0.3 [40] ensembldb_2.12.1 dbplyr_1.4.4 BSgenome_1.56.0 [43] htmlwidgets_1.5.1 rlang_0.4.7 rstudioapi_0.11 [46] RSQLite_2.2.0 generics_0.0.2 BiocParallel_1.22.0 [49] dplyr_1.0.2 VariantAnnotation_1.34.0 RCurl_1.98-1.2 [52] magrittr_1.5 GenomeInfoDbData_1.2.3 Formula_1.2-3 [55] Matrix_1.2-18 Rcpp_1.0.5 munsell_0.5.0 [58] Rhdf5lib_1.10.1 lifecycle_0.2.0 stringi_1.4.6 [61] zlibbioc_1.34.0 rhdf5_2.32.2 BiocFileCache_1.12.1 [64] grid_4.0.2 blob_1.2.1 crayon_1.3.4 [67] lattice_0.20-41 Biostrings_2.56.0 splines_4.0.2 [70] GenomicFeatures_1.40.1 hms_0.5.3 locfit_1.5-9.4 [73] knitr_1.29 pillar_1.4.6 igraph_1.2.5 [76] biomaRt_2.44.1 XML_3.99-0.5 glue_1.4.1 [79] biovizBase_1.36.0 latticeExtra_0.6-29 data.table_1.13.0 [82] png_0.1-7 vctrs_0.3.2 gtable_0.3.0 [85] openssl_1.4.2 purrr_0.3.4 assertthat_0.2.1 [88] ggplot2_3.3.2 xfun_0.16 AnnotationFilter_1.12.0 [91] survival_3.2-3 tibble_3.0.3 GenomicAlignments_1.24.0 [94] AnnotationDbi_1.50.3 memoise_1.1.0 cluster_2.1.0 [97] Rhtslib_1.20.0 ellipsis_0.3.1  ADD REPLY 0 Entering edit mode Aaron Lun ★ 27k @alun Last seen 3 hours ago The city by the bay The error and warning messages are quite informative here.  Error in assay(object, i = assay.id, withDimnames = FALSE) : 'assay(<InteractionSet>, i="character", ...)' invalid subscript 'i' 'counts' not in names(assays(<InteractionSet>))  I would say this is pretty clear; the object doesn't have an assay named "counts". You have a variable named counts, but this is not quite the same thing; the variable name doesn't count for anything when you construct the InteractionSet. The solution is pretty simple, just set the name: assayNames(data) <- "counts"  Or, during InteractionSet construction: data <- InteractionSet(list(counts=counts), gi, colData=lib.data)  Or if you don't want to rename anything, you could just set assay.id=1 in the asDGEList() call. The warning comes from the fact that your library sizes are stored in lib.sizes but asDGEList is looking in totals, probably for various historical reasons that I have long forgotten.  In addition: Warning message: In .local(object, ...) : library sizes not found in 'totals', setting to NULL  You can also tell asDGEList() to pull the library sizes from lib.sizes with: asDGEList(data, lib.sizes=data$lib.sizes)

0
Entering edit mode

Hey Aaron, thanks a lot! Is this something worth adding to your package? Ie getting a InteractionSet straight from .mcool, .cool or .hic?

0
Entering edit mode

I think that would be an excellent idea for a small package on its own. diffHic has a lot of technical debt that I need to eventually wade through and clean up - if you can make your own light-weight package (with a mind towards submitting it to Bioconductor), it would probably be a win-win for both of us. Who knows, if I can rely on your hypothetical package to make the ISets, I might be able to finally get rid of my BAM file-parsing code in diffHic.

0
Entering edit mode

If you don't want to make your own package, this could also fit well in GenomicInteractions - we already have some import functions but not for these newer formats since they didn't exist when we created the package and I've never got round to adding them. (Although, that would not really be 'lightweight', there's definitely a good argument for a standalone package)

0
Entering edit mode

Hi Aaron,

The dovetail team and I have a first draft of a package reading .cool and .mcool HDF5 contact matrix. I wrote it using the native HDF5 R package. if you want to take a look at the coolR package, install it from this repos: https://github.com/dovetail-genomics/coolR. You first need to install the packages InteractionSet and rhdf5 from bioconductor (once it's in BioC, the dependencies will be automatically taken care of):

BiocManager::install("InteractionSet")
BiocManager::install("rhdf5")


Then you can install it from GitHub using devtools:

install.packages('devtools')
devtools::install_github("git@github.com:dovetail-genomics/coolR.git")


library(diffHic)
library(edgeR)
library(coolR)

## You need ~1.25Gb of RAM per thread
cores <- detecCores()

## Reading the top level resolution in 2 multi-dimentional contact matrix
data <- cool2IntSet(c("./toy1.mcool","./toy2.mcool"),
res=1000,
cores=cores)

## Filtering low represted  pairs
keep <- aveLogCPM(asDGEList(data)) > 0

data <- data[keep,]


The plan is to submit it to BioConductor but right now the documentation is quite sparse, open it here, we are working on it but I wanted to share it with you in case you think we should add functionalities or modify something.

Let us know,

Best, Marco

0
Entering edit mode

Excellent, this looks like a good start - I'll take my comments to the GH repo.