I was using a couple while loops to process bam read data and I was wondering if it was the wrong way to do things
In the first instance I'm looping through a GRanges object and taking each range and determining the start and end of it.
GRange Object
GRangesList object of length 1:
$uc001aaa.3
GRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 [11874, 12227] + | 1 <NA>
[2] chr1 [12613, 12721] + | 3 <NA>
[3] chr1 [13221, 14409] + | 5 <NA>
Here is the code
while (i <= length(GRangesList))#loop through each range in GRangesList[i]
{
## Get start of range in GRangesList[i]
generangestartvalues<-start(GRangesList[i])
generangestart<-unlist(generangestartvalues, use.names=FALSE)
## Get end of range in GRangesList[i]
generangeendvalues<-end(GRangesList[i])#take the position value of the end of the range
generangeend<-unlist(generangeendvalues, use.names=FALSE)
i++ #increment loop
}
In the second instance I have a GRangesList and I am going through them one by one removing or shifting the entries which overlap the boundries of the list's range.
Here is a part of the GRangesList
GRangesList object of length 3374:
[[1]]
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10587, 10620] -
[[2]]
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
[1] chr1 [11392, 11425] -
And here is a simplified overview of what the loop does.
#GRangeList=List of Reads
while (i <= length(GRangeList))
{
#Detect through if statements if a read overlaps beginning or end of GRangeList
if (Overlap==TRUE)
{
#if detected, removes reads by subsetting it away GRangeList<-c(GRangeList[1:(i-1)],GRangeList[(i+1):SizeOfGRangeList]) #or shifts it completely into GRangeList range GRangeList<-shift(GRangeList[i],marginOverhang) } i++ }
Is this the best way to go about doing these things? I was told things like lapply are better than loops but do not really see how they could be easily implemented in this situation since I'd have to break off the code into separate functions and figure out a way to input my data as a matrix or dataframe.

Hi, thanks for the heads up I edited the code to hopefully make it more sensible. If you want I can make additional edits or post some more code for any clarification you might need.
Thanks
Just to clarify: it's a common misconception that "lapply are better than loops". An
lapplyis a loop. A more elegant form of loop than afor,while, orrepeatloop, but nevertheless a loop.Furthermore, and at the risk to add confusion, most vectorized operations are also performing loops internally (e.g. 1:100 + 8), but these loops are often implemented in a very efficient way (typically in native code e.g. C, C++, or Fortran). So loops per se are not the problem. Inefficient loops are the problem. And the R versions of
for,while, andrepeatloops +lapplyand family are generally (but not always) inefficient ways of looping.I said not always because
lapply(1:1000000, identity)(or the equivalent form usingfor) is actually very fast (takes less than 1 sec on my machine). But there are 2 things that are notorious for slowing down a loop a lot:generangestartandgeneRangeEndvectors. The reasons why this is bad are well known and have been discussed many times in many places. Note that it's not an R issue per se. Copying and growing a vector at each iteration is fundamentally inefficient, whatever the language is.It seems to me that your
whileloop does both (1. and 2.). That's probably why it's so slow (of course, 1. and 2. are not issues if the number of iteration is small e.g. < 100).H.
I still have trouble with your code. You say that you have a "GRange Object", but you show a GRangesList. When you unlist() the start coordinates, you then index the ith element. This doesn't make sense if you originally had a GRangesList. Please clarify your code some more.
Hi, yes GRanges is technically an element of a GRangesList and part of a nested loop but I thought it would look more confusing. Its been changed to reflect that I realized that I probably don't need the original inner loop and can just use the outer one. So its a loop incrementing through a GRangesList rather than a nested loop like it was originally.
Here's a specific example, exons grouped by transcript
I'm not really sure what you're after; maybe the minimum and maximum coordinate of each transcript.
The result of range() is a GRangesList, to allow for the possibility of trans-splicing. But there is none of that in the current annotation (since each range is represented by a single element)
table(elementLengths(rng))
so we have
> unlist(rng) GRanges object with 82960 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> uc001aaa.3 chr1 [ 11874, 14409] + uc010nxq.1 chr1 [ 11874, 14409] + uc010nxr.1 chr1 [ 11874, 14409] + uc001aal.1 chr1 [ 69091, 70008] + uc001aaq.2 chr1 [321084, 321115] + ... ... ... ... uc011mgu.1 chrUn_gl000237 [ 1, 2686] - uc011mgv.2 chrUn_gl000241 [20433, 36875] - uc011mgw.1 chrUn_gl000243 [11501, 11530] + uc022brq.1 chrUn_gl000243 [13608, 13637] + uc022brr.1 chrUn_gl000247 [ 5787, 5816] - ------- seqinfo: 93 sequences (1 circular) from hg19 genomeMaybe you're looking for the start and end coordinate of each exon. Then you might be interested simply in the start()s (as an IntegerList, i.e., a list where each element of the list is an integer vector, with length corresponding to number of exons in the transcript) or the unlisted version
> start(exByTx) IntegerList of length 82960 [["uc001aaa.3"]] 11874 12613 13221 [["uc010nxq.1"]] 11874 12595 13403 [["uc010nxr.1"]] 11874 12646 13221 [["uc001aal.1"]] 69091 [["uc001aaq.2"]] 321084 [["uc001aar.2"]] 321146 [["uc009vjk.2"]] 322037 324288 324439 [["uc001aau.3"]] 323892 324288 324439 [["uc021oeh.1"]] 324288 324439 324719 325383 [["uc021oei.1"]] 327546 ... <82950 more elements> > head(unlist(start(exByTx))) uc001aaa.3 uc001aaa.3 uc001aaa.3 uc010nxq.1 uc010nxq.1 uc010nxq.1 11874 12613 13221 11874 12595 13403These operations are fast.
For your second question, suppose we had two 'reads'
gr <- GRanges("chr1", IRanges(c(11873, 11875), width=5), "+")and we wanted to test whether they were entirely within the ranges defined by exByTx
to keep just those within the gene model we could do
> gr[gr %within% exByTx] GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [11875, 11879] + ------- seqinfo: 1 sequence from an unspecified genome; no seqlengthsA looser criterion might by
%over%, or more generally thesubsetByOverlaps()function; you could do this on millions of reads at a time, without a loop. Trimming is a bit harder, but you could reduce your GRangesList to non-overlapping regionsfind which of your reads overlap these regions
and finally calculate the 'parallel' intersection between the reads and the regions that they overlap
> p = pintersect(gr[queryHits(olaps)], regions[subjectHits(olaps)]) > p GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [11874, 11877] + [2] chr1 [11875, 11879] + ------- seqinfo: 93 sequences (1 circular) from hg19 genomeSome further processing (e.g.,
splitAsList(p, queryHits(olaps))) might be necessary if your reads overlap more than one range.Hope that helps, and sorry if I have still misunderstood your questions.
Those are some good ideas thanks