CAGEr bug report: wrong consensus cluster results
2
0
Entering edit mode
Meng • 0
@b971d31d
Last seen 11 weeks ago
United States

I found a bug in the CAGEr package when dealing with consensus clusters: the assays and rowRanges are disordered for consensusClusters.

This is caused by the code in CumulativeDistributionFunctions.R from line 93 to 95 and line 99:

setMethod(".getCumsum", c("CTSS", "GRanges"), function(ctss, clusters, useMulticore , nrCores) {
  getCumSumChrStrand <- function(chrom) {
    plus.cumsum  <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "+")
    minus.cumsum <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "-")
    c(plus.cumsum, minus.cumsum)
  }
  clusters.cumsum <- unlist(bplapply( seqlevels(clusters), getCumSumChrStrand
                                    , BPPARAM = CAGEr_Multicore(useMulticore, nrCores)))
    names(clusters.cumsum) <- names(clusters)
    clusters.cumsum
})

This deals with '+' strand data first then '-' and concat the two results for the same chromosome together and assigned the names at line 99. These code assume the order of genome ranges are sorted by strand with all '+' ranges before '-' ranges for the same chromosome. Thus, if the ranges are mixed with disordered '+' and '-', the names(clusters.cumsum) will not correspond to names(clusters) and would generated disordered results.

Here is the situation when calling consensus clusters in the AggregationMethods.R line 163, the rs would be ordered by the as.factor(names), which would reorder the results by alphabet and disrupt the original order that all '+' ranges come before '-' ranges for the same chromosome. Thus, in line 172, the consensus.clusters[rownames(counts)] recorded the consensus clusters by alphabet. This leads to the conflict with the underlying assumption in the above CumulativeDistributionFunctions.R function and cause the bug.

setMethod( ".CCtoSE"
         , c(se = "RangedSummarizedExperiment")
         , function(se, consensus.clusters, tpmThreshold = 1) {
    if (is.null(assays(se)[["normalizedTpmMatrix"]]))
      stop("Needs normalised data; run ", sQuote("normalizeTagCount()"), " first.")
    if (is.null(rowRanges(se)$cluster))
      rowRanges(se)$cluster <- ranges2names(rowRanges(se), consensus.clusters)

    if (tpmThreshold > 0)
      se <- se[rowSums(DelayedArray(assays(se)[["normalizedTpmMatrix"]])) > tpmThreshold,]

    .rowsumAsMatrix <- function(DF, names) {
      rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))
      if (rownames(rs)[1] == "") # If some CTSS were not in clusters
        rs <- rs[-1, , drop = FALSE]
      rs
      }

    counts <- .rowsumAsMatrix(assays(se)[["counts"]], rowRanges(se)$cluster)
    norm   <- .rowsumAsMatrix(assays(se)[["normalizedTpmMatrix"]], rowRanges(se)$cluster)

      SummarizedExperiment( rowRanges = consensus.clusters[rownames(counts)]
                          , assays    = SimpleList( counts     = counts
                                                  , normalized = norm))
})

To solve this, reorder = FALSE should be added in rowsum at line 163. Thus, the code should change from

rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))

to

rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names), reorder = FALSE)
CAGEr • 180 views
ADD COMMENT
0
Entering edit mode
Meng • 0
@b971d31d
Last seen 11 weeks ago
United States

Another bug in the start and end coordinations of consensus clusters. For AggregationMethods.R line 132 and 133

end(x)   <- mcols(x)[[paste0("q_", qUp) ]] + start(x)
start(x) <- mcols(x)[[paste0("q_", qLow)]] + start(x)

should be

end(x)   <- decode(mcols(x)[[paste0("q_", qUp) ]]) + start(x) - 1
start(x) <- decode(mcols(x)[[paste0("q_", qLow)]]) + start(x) - 1
ADD COMMENT
0
Entering edit mode
Meng • 0
@b971d31d
Last seen 11 weeks ago
United States

I found another bug that not only consensus cluster results are disordered but also the tag clusters results are disordered.

This is due to that in CumulativeDistributionFunctions.R line 97 deals with each chromosome data in the order or seqlevels(clusters), which is chr1, chr2 ... chr10, chr11, ...

However, when generating tagClusters in ClusteringFunctions.R line 210 seqnames = Rle(factor(clusters$chr)), with default levels, the order of chromosome would become chr1, chr10, chr11 ... chr2 ... after sorting at line 218. This doesn't match the order of results generated in CumulativeDistributionFunctions.R.

To fix this, line 210 of ClusteringFunctions.R:

gr <- GRanges( seqnames = Rle(factor(clusters$chr))

should be changed to:

gr <- GRanges( seqnames = Rle(factor(clusters$chr, levels = unique(clusters$chr)))
ADD COMMENT

Login before adding your answer.

Traffic: 332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6