Question

CAGEr bug report: wrong consensus cluster results

1

Entering edit mode

Meng ▴ 30

@b971d31d

Last seen 9 months ago

United States

I found a bug in the CAGEr package when dealing with consensus clusters: the assays and rowRanges are disordered for consensusClusters.

This is caused by the code in CumulativeDistributionFunctions.R from line 93 to 95 and line 99:

setMethod(".getCumsum", c("CTSS", "GRanges"), function(ctss, clusters, useMulticore , nrCores) {
  getCumSumChrStrand <- function(chrom) {
    plus.cumsum  <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "+")
    minus.cumsum <- .getCumsumChr2(clusters = clusters, ctss = ctss, chrom = chrom, str = "-")
    c(plus.cumsum, minus.cumsum)
  }
  clusters.cumsum <- unlist(bplapply( seqlevels(clusters), getCumSumChrStrand
                                    , BPPARAM = CAGEr_Multicore(useMulticore, nrCores)))
    names(clusters.cumsum) <- names(clusters)
    clusters.cumsum
})

This deals with '+' strand data first then '-' and concat the two results for the same chromosome together and assigned the names at line 99. These code assume the order of genome ranges are sorted by strand with all '+' ranges before '-' ranges for the same chromosome. Thus, if the ranges are mixed with disordered '+' and '-', the names(clusters.cumsum) will not correspond to names(clusters) and would generated disordered results.

Here is the situation when calling consensus clusters in the AggregationMethods.R line 163, the rs would be ordered by the as.factor(names), which would reorder the results by alphabet and disrupt the original order that all '+' ranges come before '-' ranges for the same chromosome. Thus, in line 172, the consensus.clusters[rownames(counts)] recorded the consensus clusters by alphabet. This leads to the conflict with the underlying assumption in the above CumulativeDistributionFunctions.R function and cause the bug.

setMethod( ".CCtoSE"
         , c(se = "RangedSummarizedExperiment")
         , function(se, consensus.clusters, tpmThreshold = 1) {
    if (is.null(assays(se)[["normalizedTpmMatrix"]]))
      stop("Needs normalised data; run ", sQuote("normalizeTagCount()"), " first.")
    if (is.null(rowRanges(se)$cluster))
      rowRanges(se)$cluster <- ranges2names(rowRanges(se), consensus.clusters)

    if (tpmThreshold > 0)
      se <- se[rowSums(DelayedArray(assays(se)[["normalizedTpmMatrix"]])) > tpmThreshold,]

    .rowsumAsMatrix <- function(DF, names) {
      rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))
      if (rownames(rs)[1] == "") # If some CTSS were not in clusters
        rs <- rs[-1, , drop = FALSE]
      rs
      }

    counts <- .rowsumAsMatrix(assays(se)[["counts"]], rowRanges(se)$cluster)
    norm   <- .rowsumAsMatrix(assays(se)[["normalizedTpmMatrix"]], rowRanges(se)$cluster)

      SummarizedExperiment( rowRanges = consensus.clusters[rownames(counts)]
                          , assays    = SimpleList( counts     = counts
                                                  , normalized = norm))
})

To solve this, reorder = FALSE should be added in rowsum at line 163. Thus, the code should change from

rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names))

to

rs <- rowsum(as.matrix(DelayedArray(DF)), as.factor(names), reorder = FALSE)

CAGEr • 1.2k views

ADD COMMENT • link updated 9 months ago by Charles Plessy ▴ 180 • written 16 months ago by Meng ▴ 30

0

Entering edit mode

Thank you very much, this is very useful! I regret that your post did not catch my attention earlier!

ADD REPLY • link 9 months ago Charles Plessy ▴ 180

score 2 · Accepted Answer · 2022-12-03

2

Entering edit mode

Meng ▴ 30

@b971d31d

Last seen 9 months ago

United States

Another bug in the start and end coordinations of consensus clusters. For AggregationMethods.R line 132 and 133

end(x)   <- mcols(x)[[paste0("q_", qUp) ]] + start(x)
start(x) <- mcols(x)[[paste0("q_", qLow)]] + start(x)

should be

end(x)   <- decode(mcols(x)[[paste0("q_", qUp) ]]) + start(x) - 1
start(x) <- decode(mcols(x)[[paste0("q_", qLow)]]) + start(x) - 1

ADD COMMENT • link 16 months ago Meng ▴ 30

0

Entering edit mode

Thanks again! I found the bug by chance last week. In my fix I did not decode but I had to specify integer type for the addition (+ 1L).

ADD REPLY • link 9 months ago Charles Plessy ▴ 180

score 2 · Accepted Answer · 2022-12-04

2

Entering edit mode

Meng ▴ 30

@b971d31d

Last seen 9 months ago

United States

I found another bug that not only consensus cluster results are disordered but also the tag clusters results are disordered.

This is due to that in CumulativeDistributionFunctions.R line 97 deals with each chromosome data in the order or seqlevels(clusters), which is chr1, chr2 ... chr10, chr11, ...

However, when generating tagClusters in ClusteringFunctions.R line 210 seqnames = Rle(factor(clusters$chr)), with default levels, the order of chromosome would become chr1, chr10, chr11 ... chr2 ... after sorting at line 218. This doesn't match the order of results generated in CumulativeDistributionFunctions.R.

To fix this, line 210 of ClusteringFunctions.R:

gr <- GRanges( seqnames = Rle(factor(clusters$chr))

should be changed to:

gr <- GRanges( seqnames = Rle(factor(clusters$chr, levels = unique(clusters$chr)))

ADD COMMENT • link 16 months ago Meng ▴ 30

0

Entering edit mode

Thanks! I think that I will fix this one using seqinfo= to inherit the seqlevel order from the BSgenome object or the CTSS genomic ranges..

ADD REPLY • link 9 months ago Charles Plessy ▴ 180

0

Entering edit mode

Thanks, my issue has been fixed.

ADD REPLY • link 9 months ago amandajansenofficial • 0

0

Entering edit mode

Cool, can I ask you to mark it as such by clicking the accept button of my last answer about updates? This will remove the issue from the list of unsolved ones.

ADD REPLY • link 9 months ago Charles Plessy ▴ 180

score 1 · Accepted Answer · 2023-07-06

1

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 6 months ago

Japan

Thank you again for your report. I beleive that I have corrected the bugs in the following updates of Bioconductor's release and devel branches:

https://bioconductor.org/packages/3.17/bioc/html/CAGEr.html (CAGEr version 2.6.1)
https://bioconductor.org/packages/3.18/bioc/html/CAGEr.html (CAGEr version 2.7.1)

ADD COMMENT • link 9 months ago Charles Plessy ▴ 180