Question

Speeding up or replacing Loops to process Bam Reads

0

Entering edit mode

gen ▴ 30

@gen-7383

Last seen 10.8 years ago

United States

I was using a couple while loops to process bam read data and I was wondering if it was the wrong way to do things

In the first instance I'm looping through a GRanges object and taking each range and determining the start and end of it.

GRange Object

GRangesList object of length 1:
$uc001aaa.3 
GRanges object with 3 ranges and 2 metadata columns:
      seqnames         ranges strand |   exon_id   exon_name
         <Rle>      <IRanges>  <Rle> | <integer> <character>
  [1]     chr1 [11874, 12227]      + |         1        <NA>
  [2]     chr1 [12613, 12721]      + |         3        <NA>
  [3]     chr1 [13221, 14409]      + |         5        <NA>

Here is the code

while (i <= length(GRangesList))#loop through each range in GRangesList[i]
{
    ## Get start of range in GRangesList[i]
    generangestartvalues<-start(GRangesList[i])
    generangestart<-unlist(generangestartvalues, use.names=FALSE)
   

    ## Get end of range in GRangesList[i]
    generangeendvalues<-end(GRangesList[i])#take the position value of the end of the range
    generangeend<-unlist(generangeendvalues, use.names=FALSE)

    i++ #increment loop
}

In the second instance I have a GRangesList and I am going through them one by one removing or shifting the entries which overlap the boundries of the list's range.

Here is a part of the GRangesList

GRangesList object of length 3374:
[[1]]
GRanges object with 1 range and 0 metadata columns:
      seqnames         ranges strand
         <Rle>      <IRanges>  <Rle>
  [1]     chr1 [10587, 10620]      -

[[2]]
GRanges object with 1 range and 0 metadata columns:
      seqnames         ranges strand
  [1]     chr1 [11392, 11425]      -

And here is a simplified overview of what the loop does.

#GRangeList=List of Reads
while (i <= length(GRangeList))
{

#Detect through if statements if a read overlaps beginning or end of GRangeList


if (Overlap==TRUE)
{

#if detected, removes reads by subsetting it away
GRangeList<-c(GRangeList[1:(i-1)],GRangeList[(i+1):SizeOfGRangeList])

#or shifts it completely into GRangeList range
GRangeList<-shift(GRangeList[i],marginOverhang)
}

i++
}

Is this the best way to go about doing these things? I was told things like lapply are better than loops but do not really see how they could be easily implemented in this situation since I'd have to break off the code into separate functions and figure out a way to input my data as a matrix or dataframe.

genomicranges reads genomicfeatures Bam • 2.4k views

ADD COMMENT • link 10.8 years ago gen ▴ 30

score 0 · Answer 1 · 2015-03-23

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 months ago

United States

Generally using loops is not the right thing to do; instead use vectorized operations. Your code is not complete enough to know what to suggest; can you spend a few minutes updating it so that the structure of each object, e.g., eByg, seqinputbegin, GRanges, is clear, and that the formatting and errors are corrected, e.g., geneRangeEnd refers to generangestartvalues2, but should I guess refer to the end values? At first pass I guess you're maybe looking for

j <- 1:length(GRanges)
starts <- unlist(start(eByg)[seqinputbegin][j], use.names=FALSE)

ADD COMMENT • link 10.8 years ago Martin Morgan 25k

0

Entering edit mode

Hi, thanks for the heads up I edited the code to hopefully make it more sensible. If you want I can make additional edits or post some more code for any clarification you might need.

Thanks

ADD REPLY • link 10.8 years ago gen ▴ 30

0

Entering edit mode

Just to clarify: it's a common misconception that "lapply are better than loops". An lapply is a loop. A more elegant form of loop than a for, while, or repeat loop, but nevertheless a loop.

Furthermore, and at the risk to add confusion, most vectorized operations are also performing loops internally (e.g. 1:100 + 8), but these loops are often implemented in a very efficient way (typically in native code e.g. C, C++, or Fortran). So loops per se are not the problem. Inefficient loops are the problem. And the R versions of for, while, and repeat loops + lapply and family are generally (but not always) inefficient ways of looping.

I said not always because lapply(1:1000000, identity) (or the equivalent form using for) is actually very fast (takes less than 1 sec on my machine). But there are 2 things that are notorious for slowing down a loop a lot:

Growing an object at each iteration. This is what you're doing with the generangestart and geneRangeEnd vectors. The reasons why this is bad are well known and have been discussed many times in many places. Note that it's not an R issue per se. Copying and growing a vector at each iteration is fundamentally inefficient, whatever the language is.
The code inside the loop needs to create or extract an S4 object at each iteration. This slows down the loop a lot. That's because S4 instanciation itself is unfortunately very expensive (it has a huge overhead). This is a well known issue of the S4 class system itself.

It seems to me that your while loop does both (1. and 2.). That's probably why it's so slow (of course, 1. and 2. are not issues if the number of iteration is small e.g. < 100).

H.

ADD REPLY • link 10.8 years ago Hervé Pagès 16k

0

Entering edit mode

I still have trouble with your code. You say that you have a "GRange Object", but you show a GRangesList. When you unlist() the start coordinates, you then index the ith element. This doesn't make sense if you originally had a GRangesList. Please clarify your code some more.

ADD REPLY • link 10.8 years ago Martin Morgan 25k

0

Entering edit mode

Hi, yes GRanges is technically an element of a GRangesList and part of a nested loop but I thought it would look more confusing. Its been changed to reflect that I realized that I probably don't need the original inner loop and can just use the outer one. So its a loop incrementing through a GRangesList rather than a nested loop like it was originally.

ADD REPLY • link 10.8 years ago gen ▴ 30

0

Entering edit mode

Here's a specific example, exons grouped by transcript

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
exByTx = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx", use.names=TRUE)

I'm not really sure what you're after; maybe the minimum and maximum coordinate of each transcript.

rng <- range(exByTx)

The result of range() is a GRangesList, to allow for the possibility of trans-splicing. But there is none of that in the current annotation (since each range is represented by a single element)

table(elementLengths(rng))

so we have

> unlist(rng)
GRanges object with 82960 ranges and 0 metadata columns:
                   seqnames           ranges strand
                      <Rle>        <IRanges>  <Rle>
  uc001aaa.3           chr1 [ 11874,  14409]      +
  uc010nxq.1           chr1 [ 11874,  14409]      +
  uc010nxr.1           chr1 [ 11874,  14409]      +
  uc001aal.1           chr1 [ 69091,  70008]      +
  uc001aaq.2           chr1 [321084, 321115]      +
         ...            ...              ...    ...
  uc011mgu.1 chrUn_gl000237   [    1,  2686]      -
  uc011mgv.2 chrUn_gl000241   [20433, 36875]      -
  uc011mgw.1 chrUn_gl000243   [11501, 11530]      +
  uc022brq.1 chrUn_gl000243   [13608, 13637]      +
  uc022brr.1 chrUn_gl000247   [ 5787,  5816]      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Maybe you're looking for the start and end coordinate of each exon. Then you might be interested simply in the start()s (as an IntegerList, i.e., a list where each element of the list is an integer vector, with length corresponding to number of exons in the transcript) or the unlisted version

> start(exByTx)
IntegerList of length 82960
[["uc001aaa.3"]] 11874 12613 13221
[["uc010nxq.1"]] 11874 12595 13403
[["uc010nxr.1"]] 11874 12646 13221
[["uc001aal.1"]] 69091
[["uc001aaq.2"]] 321084
[["uc001aar.2"]] 321146
[["uc009vjk.2"]] 322037 324288 324439
[["uc001aau.3"]] 323892 324288 324439
[["uc021oeh.1"]] 324288 324439 324719 325383
[["uc021oei.1"]] 327546
...
<82950 more elements>
> head(unlist(start(exByTx)))
uc001aaa.3 uc001aaa.3 uc001aaa.3 uc010nxq.1 uc010nxq.1 uc010nxq.1 
     11874      12613      13221      11874      12595      13403

These operations are fast.

For your second question, suppose we had two 'reads'

gr <- GRanges("chr1", IRanges(c(11873, 11875), width=5), "+")

and we wanted to test whether they were entirely within the ranges defined by exByTx

> gr %within% exByTx
[1] FALSE  TRUE

to keep just those within the gene model we could do

> gr[gr %within% exByTx]
GRanges object with 1 range and 0 metadata columns:
      seqnames         ranges strand
         <Rle>      <IRanges>  <Rle>
  [1]     chr1 [11875, 11879]      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

A looser criterion might by %over%, or more generally the subsetByOverlaps() function; you could do this on millions of reads at a time, without a loop. Trimming is a bit harder, but you could reduce your GRangesList to non-overlapping regions

regions <- reduce(unlist(exByTx))

find which of your reads overlap these regions

olaps <- findOverlaps(gr, regions)

and finally calculate the 'parallel' intersection between the reads and the regions that they overlap

> p = pintersect(gr[queryHits(olaps)], regions[subjectHits(olaps)])
> p
GRanges object with 2 ranges and 0 metadata columns:
      seqnames         ranges strand
         <Rle>      <IRanges>  <Rle>
  [1]     chr1 [11874, 11877]      +
  [2]     chr1 [11875, 11879]      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Some further processing (e.g., splitAsList(p, queryHits(olaps))) might be necessary if your reads overlap more than one range.

Hope that helps, and sorry if I have still misunderstood your questions.

ADD REPLY • link 10.8 years ago Martin Morgan 25k

0

Entering edit mode

Those are some good ideas thanks

ADD REPLY • link 10.8 years ago gen ▴ 30