Question

Subsetting operation on CompressedGRangesList object 'x' produces a result that is too big to be represented as a CompressedList object.

0

Entering edit mode

Jiping Wang ▴ 90

@jiping-wang-6687

Last seen 20 months ago

United States

Enter the body of text here

Users of DegNorm bioconductor package encountered an error when loading in very large bam file to compute the read coverage score on transcript. The errors were shown as in the post title. When the bam file size is relatively smaller, such error doesn't happen. I noticed a similar discussion on this was IRanges::findOverlapPairs cannot take inputs if they are too large. But I couldn't figure out an easy fix in my problem. I would greatly appreciate help.

We have RNA-seq data from bam file, single-end or paired-end. DegNorm tries to calculate the read coverage on the total transcript-- concatenation of all exons from a gene. DegNorm follows the HTSEQ to define a valid read to be one that completely sit "within" the exon regions. In DegNorm package, defined a function to calculate the read coverage curve as follows:

.paired_end_cov_by_ch=function(bam,all_genes,ch){
    ##function to calculate the coverage score and return in Rle objects
    bf = BamFile(bam, asMates = TRUE, qnameSuffixStart = ".")
    tx_compat_cvg=RleList()
    genes=all_genes[unlist(runValue(seqnames(all_genes)))==ch]
    gr = as(seqinfo(bf), "GRanges")
    if(ch %in% runValue(seqnames(gr))){
        param = ScanBamParam(which = gr[ch],flag=scanBamFlag(isPaired=TRUE, 
                            isSecondaryAlignment=FALSE))
        gal2=readGAlignmentPairs(bf, param = param)
        gal2=gal2[strand(gal2)!="*"]
        gal2=gal2[is.na(seqnames(gal2))==FALSE]
        grl <- as(gal2, "GRangesList")
        gal3=reduce(grl)
        #only keep genes on the current chromosome
        results=.IntersectionStrict2(genes,gal2)
        tx2reads <- setNames(as(t(results), "List"), names(genes))
        ## calculate read pair
        read_counts=data.frame(count=lengths(tx2reads))
        rownames(read_counts)=names(tx2reads)

        #relist gal3 based on compatibility
        temp0=data.table(group=as(gal3,"data.frame")$group,
                        index=seq_len(sum(lengths(gal3))))
        #group2indexMap=unique(setDT(temp0))[, .(values = index), by =group ] 

        idmatch=function(tx2reads,group2indexMap){
            return(group2indexMap[group2indexMap$group %in% tx2reads]$index)
        }
        #tx2reads1=lapply(tx2reads,idmatch,group2indexMap=group2indexMap)
        tx2reads1=lapply(tx2reads,idmatch,group2indexMap=temp0)
        gal4=unlist(gal3)
        rm(gal3)
        compat_reads_by_tx <- extractList(gal4, tx2reads1)
        tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx,
                                            genes,
                                            ignore.strand=TRUE)
    }else{
        tx2reads1=IntegerList(vector("list", length(genes)))
        left_gal=GAlignments() #empty alignment file
        names(tx2reads1)=names(genes)
        read_counts=data.frame(count=lengths(tx2reads1))
        rownames(read_counts)=names(tx2reads1)
        compat_reads_by_tx <- extractList(left_gal, tx2reads1)
        tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx,
                                            genes,
                                            ignore.strand=FALSE)
    }
    return(list(tx_compat_cvg,read_counts))
}

I was able to narrow down that the error was from the pcoverageByTranscript function.

        tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx,
                                            genes,
                                            ignore.strand=FALSE)

Thanks for help.

DegNorm GenomicAlignments • 1.5k views

ADD COMMENT • link 2.0 years ago • updated 23 months ago Jiping Wang ▴ 90

0

Entering edit mode

Any help on why pcoverageByTranscript generate such errors? thanks.

ADD REPLY • link 23 months ago Jiping Wang ▴ 90

score 0 · Answer 1 · 2022-04-23

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 7 hours ago

Seattle, WA, United States

Hi Jiping,

Unlike findOverlapPairs(), findOverlaps() doesn't try to subset the supplied query or subject so I don't think the error comes from this function. It would help if you could provide a reproducible example.

Thanks,

H.

ADD COMMENT • link 2.0 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks for your comment. The error was encountered by a user, who might not be able to provide the bam file to me. I cannot reproduce the error at this moment.

ADD REPLY • link 24 months ago Jiping Wang ▴ 90

0

Entering edit mode

Hi H., I was able to narrow down to the line of codes that generated the error. It was from the pcoverageByTranscript function.

        tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx, genes, ignore.strand=TRUE)

Do you have any advice on a fix? thanks.

Jiping

ADD REPLY • link 23 months ago Jiping Wang ▴ 90

0

Entering edit mode

Great! It would be ideal if you could provide a reproducible example. That would help A LOT.

But if, for whatever reason, that's not possible, I would need more details. The .paired_end_cov_by_ch() function above has 2 calls to pcoverageByTranscript(), one at the end of each branch of the big if statement. Which of the 2 calls is causing problems? What are the classes and lengths of compat_reads_by_tx and genes that are passed to pcoverageByTranscript()?

To obtain this information, you could call debug(DegNorm:::.paired_end_cov_by_ch) before running the analysis on your big BAM file, and then execute the code in the .paired_end_cov_by_ch function line by line by pressing <Enter> for each line. Then right before the pcoverageByTranscript line is about to be executed, do the following:

compat_reads_by_tx
class(compat_reads_by_tx)
genes
class(genes)

and show us the output.

Then press <Enter> one more time. This should trigger the error. Please copy/paste that part here too.

Thanks,

H.

ADD REPLY • link 23 months ago Hervé Pagès 16k

0

Entering edit mode

Thanks so much H.! I ran the codes line by line to find the problem. The error was from the first pcoverageByTranscript call ( i.e. from if, not the else branch).

I also ran the four lines of codes and the output are as follows:

> compat_reads_by_tx
GRangesList object of length 1254:
$ENSG00000017797
GRanges object with 13960 ranges and 0 metadata columns:
          seqnames          ranges strand
             <Rle>       <IRanges>  <Rle>
      [1]       18 9475662-9475691      -
      [2]       18 9512991-9513010      -
      [3]       18 9513033-9513082      -
      [4]       18 9475646-9475691      -
      [5]       18 9512991-9512994      -
      ...      ...             ...    ...
  [13956]       18 9538054-9538103      -
  [13957]       18 9537957-9538006      -
  [13958]       18 9538054-9538103      -
  [13959]       18 9537957-9538006      -
  [13960]       18 9538054-9538103      -
  -------
  seqinfo: 194 sequences from an unspecified genome

...
<1253 more elements>
> class(compat_reads_by_tx)
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"
> genes
GRangesList object of length 1254:
$ENSG00000017797
GRanges object with 21 ranges and 2 metadata columns:
       seqnames          ranges strand |   exon_id       exon_name
          <Rle>       <IRanges>  <Rle> | <integer>     <character>
   [1]       18 9475009-9475176      + |    649239 ENSE00001496923
   [2]       18 9475454-9475691      + |    649240 ENSE00001541208
   [3]       18 9475513-9475691      + |    649241 ENSE00003849898
   [4]       18 9475670-9475691      + |    649242 ENSE00002716345
   [5]       18 9476127-9476286      + |    649243 ENSE00002703651
   ...      ...             ...    ... .       ...             ...
  [17]       18 9525720-9525851      + |    649256 ENSE00000666013
  [18]       18 9530834-9530941      + |    649257 ENSE00001187739
  [19]       18 9533335-9533456      + |    649258 ENSE00001142648
  [20]       18 9533703-9533826      + |    649259 ENSE00001018078
  [21]       18 9535671-9538114      + |    649260 ENSE00001541191
  -------
  seqinfo: 47 sequences (1 circular) from an unspecified genome; no seqlengths

...
<1253 more elements>
> class(genes)
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"

I also did the debug of this function .paired_end_cov_by_ch. Here is the output. This confirms that the error comes from the very last line pcoverageByTranscript(compat_reads_by_tx, genes, ignore.strand = TRUE).

Browse[2]> 
debug at #31: tx2reads1 = lapply(tx2reads, idmatch, group2indexMap = temp0)
Browse[2]> 
debug at #32: gal4 = unlist(gal3)
Browse[2]> 
debug at #33: rm(gal3)
Browse[2]> 
debug at #34: compat_reads_by_tx <- extractList(gal4, tx2reads1)
Browse[2]> 
debug at #35: tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx, genes, 
    ignore.strand = TRUE)
Browse[2]> 
Error in METHOD(x, i) : 
  Subsetting operation on CompressedGRangesList object 'x' produces a result that is too big
  to be represented as a CompressedList object. Please try to coerce 'x' to a SimpleList
  object first (with 'as(x, "SimpleList")').

Does this help? thanks so much! If not, I could provide R objects (compat_reads_by_tx and genes), but probably not the bam file due to some reasons. There were 21 million reads for this chromosome. I don't know whether that causes the real issue.

ADD REPLY • link 23 months ago Jiping Wang ▴ 90

0

Entering edit mode

Hi Hervé,

Do you have any clues what the bug or problem could be? thanks.

Jiping

ADD REPLY • link 23 months ago Jiping Wang ▴ 90

0

Entering edit mode

Hi Jiping,

Sorry for the slow response and thanks for providing those details. They allowed me to reproduce the problem and to come up with a fix. The fix is in GenomicFeatures 1.48.2 (BioC 3.15) and 1.49.5 (BioC devel). Both should become available via BiocManager::install() in the next couple of days.

Note that pcoverageByTranscript() was designed and intended to compute the coverage on transcripts, not on genes. More precisely, the 2nd argument of the function is expected to be:

  transcripts: A GRangesList object representing the exons of each
          transcript for which to compute coverage. For each
          transcript, the exons must be ordered by _ascending rank_,
          that is, by their position in the transcript.
          ...

You can't have this if the GRangesList object contains exons grouped by gene and the results produced by the function in that case are not meaningful.

Best,

H.

ADD REPLY • link 23 months ago Hervé Pagès 16k

0

Entering edit mode

that's terrific. thanks.

thanks for your comment on pcoverageByTranscript function as well. yes, we understand it's for transcript. that's why we constructed a total transcript by concatenating all exons to define a "total" transcript.

ADD REPLY • link 23 months ago Jiping Wang ▴ 90