Can't normalize samples in diffHic
1
0
Entering edit mode
arielpaulson ▴ 20
@arielpaulson-10099
Last seen 3.4 years ago
United States

Hi,

I'm trying to analyze existing contact matrices generated by the Juicer pipeline. Basically I have sparse matrices in 3-column txt format where rows are { row_idx, col_idx, cell_value }.

I have three replicates each in mut and wt, and wt has more reads than mut (342M reads vs 309M).

I followed the instructions I found in these posts: How to use diffHiC with existing contact matrices? and problem recasting InteractionSet into a DGEList, then had to debug from there, but these are a few years old and I don't know if the finer points of the object structures have changed in this time.

Basically the whole workflow runs without error, but in result$table, all mut/wt LFC< 0 and 99% of p-values are basically 0. It does not appear to normalize for library size differences.

Thanks, Ariel

library(diffHic)
library(Matrix)
library(edgeR)
library(csaw)

bed <- read.delim("bin_1000000.bed", header=FALSE)
colnames(bed) <- c("Chrom","Start","End","Bin")
bed.gr <- makeGRangesFromDataFrame(bed)

samples <- c("mut1","mut2","mut3","wt1","wt2","wt3")
S <- length(samples)
design <- cbind(mut=c(1,1,1,0,0,0), wt=c(0,0,0,1,1,1))

conn <- keep <- setNames(vector("list",S),samples)
for (i in 1:S) {
    d <- read.delim(paste0(samples[i],".matrix.gz"), header=FALSE)
    sm <- sparseMatrix(i=d[,1], j=d[,2], x=d[,3])
    conn[[i]] <- ContactMatrix(sm, bed.gr, bed.gr)
    keep[[i]] <- sm!=0
}

libsizes <- sapply(1:S, function(i) sum(conn[[i]]@matrix) )

libsizes
#      mut1      mut2      mut3       wt1       wt2       wt3
# 104674719 102308339 102586499 119988904  95368796 126548928

to.keep <- Reduce("|", keep)
iset <- lapply(conn, function(x) deflate(x, extract=to.keep) )

data <- do.call(cbind, iset)
interactions(data) <- as(interactions(data), "ReverseStrictGInteractions")
assayNames(data) <- "counts"
colnames(data) <- samples
colnames(attributes(data)$assays@data[[1]]) <- samples
data$totals <- libsizes

data <- normOffsets(data, method="loess", se.out=TRUE)
norm <- estimateDisp(asDGEList(data), design)
result <- glmQLFTest(glmQLFit(norm, design, robust=TRUE))

table(sign(result$table$logFC))
#     -1
# 446035

summary(result$table$PValue)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
# 0.000e+00 0.000e+00 0.000e+00 1.270e-09 0.000e+00 6.957e-05



sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /n/apps/CentOS7/install/r-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-4.1.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] openxlsx_4.2.3              csaw_1.26.0
 [3] edgeR_3.34.0                limma_3.48.0
 [5] Matrix_1.3-4                diffHic_1.24.0
 [7] InteractionSet_1.20.0       SummarizedExperiment_1.22.0
 [9] Biobase_2.52.0              MatrixGenerics_1.4.0
[11] matrixStats_0.59.0          GenomicRanges_1.44.0
[13] GenomeInfoDb_1.28.0         IRanges_2.26.0
[15] S4Vectors_0.30.0            BiocGenerics_0.38.0

loaded via a namespace (and not attached):
 [1] zip_2.2.0                Rcpp_1.0.6               compiler_4.1.0
 [4] restfulr_0.0.13          XVector_0.32.0           bitops_1.0-7
 [7] rhdf5filters_1.4.0       tools_4.1.0              zlibbioc_1.38.0
[10] metapod_1.0.0            rhdf5_2.36.0             lattice_0.20-44
[13] BSgenome_1.60.0          DelayedArray_0.18.0      rstudioapi_0.13
[16] yaml_2.2.1               GenomeInfoDbData_1.2.6   rtracklayer_1.52.0
[19] Biostrings_2.60.1        locfit_1.5-9.4           grid_4.1.0
[22] XML_3.99-0.6             BiocParallel_1.26.0      Rhdf5lib_1.14.1
[25] Rhtslib_1.24.0           Rsamtools_2.8.0          GenomicAlignments_1.28.0
[28] stringi_1.6.2            RCurl_1.98-1.3           crayon_1.4.1
[31] rjson_0.2.20             BiocIO_1.2.0
diffHic • 1.0k views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 7 hours ago
The city by the bay

Your comparison isn't set up correctly. You have a cell-means model but you haven't supplied contrast= to glmQLFTest(), which means that you're using the default hypothesis, i.e., that the last coefficient is equal to zero. In your cell-means model, this means that you're actually testing whether the mean of wt is zero. This doesn't make any scientific sense; your contrast should be something like limma::makeContrasts(mut - wt, levels=design).

ADD COMMENT
0
Entering edit mode

I was having such a time trying to get the intermediate steps to work, I completely forgot the contrast object, of course... That fixed it!

ADD REPLY

Login before adding your answer.

Traffic: 318 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6