I was using a couple while loops to process bam read data and I was wondering if it was the wrong way to do things
In the first instance I'm looping through a GRanges object and taking each range and determining the start and end of it.
GRange Object
GRangesList object of length 1:
$uc001aaa.3
GRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 [11874, 12227] + | 1 <NA>
[2] chr1 [12613, 12721] + | 3 <NA>
[3] chr1 [13221, 14409] + | 5 <NA>
Here is the code
while (i <= length(GRangesList))#loop through each range in GRangesList[i] { ## Get start of range in GRangesList[i] generangestartvalues<-start(GRangesList[i]) generangestart<-unlist(generangestartvalues, use.names=FALSE) ## Get end of range in GRangesList[i] generangeendvalues<-end(GRangesList[i])#take the position value of the end of the range generangeend<-unlist(generangeendvalues, use.names=FALSE) i++ #increment loop }
In the second instance I have a GRangesList and I am going through them one by one removing or shifting the entries which overlap the boundries of the list's range.
Here is a part of the GRangesList
GRangesList object of length 3374: [[1]] GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [10587, 10620] - [[2]] GRanges object with 1 range and 0 metadata columns: seqnames ranges strand [1] chr1 [11392, 11425] -
And here is a simplified overview of what the loop does.
#GRangeList=List of Reads while (i <= length(GRangeList)) { #Detect through if statements if a read overlaps beginning or end of GRangeList if (Overlap==TRUE) {
#if detected, removes reads by subsetting it away GRangeList<-c(GRangeList[1:(i-1)],GRangeList[(i+1):SizeOfGRangeList]) #or shifts it completely into GRangeList range GRangeList<-shift(GRangeList[i],marginOverhang) } i++ }
Is this the best way to go about doing these things? I was told things like lapply are better than loops but do not really see how they could be easily implemented in this situation since I'd have to break off the code into separate functions and figure out a way to input my data as a matrix or dataframe.
Hi, thanks for the heads up I edited the code to hopefully make it more sensible. If you want I can make additional edits or post some more code for any clarification you might need.
Thanks
Just to clarify: it's a common misconception that "lapply are better than loops". An
lapply
is a loop. A more elegant form of loop than afor
,while
, orrepeat
loop, but nevertheless a loop.Furthermore, and at the risk to add confusion, most vectorized operations are also performing loops internally (e.g. 1:100 + 8), but these loops are often implemented in a very efficient way (typically in native code e.g. C, C++, or Fortran). So loops per se are not the problem. Inefficient loops are the problem. And the R versions of
for
,while
, andrepeat
loops +lapply
and family are generally (but not always) inefficient ways of looping.I said not always because
lapply(1:1000000, identity)
(or the equivalent form usingfor
) is actually very fast (takes less than 1 sec on my machine). But there are 2 things that are notorious for slowing down a loop a lot:generangestart
andgeneRangeEnd
vectors. The reasons why this is bad are well known and have been discussed many times in many places. Note that it's not an R issue per se. Copying and growing a vector at each iteration is fundamentally inefficient, whatever the language is.It seems to me that your
while
loop does both (1. and 2.). That's probably why it's so slow (of course, 1. and 2. are not issues if the number of iteration is small e.g. < 100).H.
I still have trouble with your code. You say that you have a "GRange Object", but you show a GRangesList. When you unlist() the start coordinates, you then index the ith element. This doesn't make sense if you originally had a GRangesList. Please clarify your code some more.
Hi, yes GRanges is technically an element of a GRangesList and part of a nested loop but I thought it would look more confusing. Its been changed to reflect that I realized that I probably don't need the original inner loop and can just use the outer one. So its a loop incrementing through a GRangesList rather than a nested loop like it was originally.
Here's a specific example, exons grouped by transcript
I'm not really sure what you're after; maybe the minimum and maximum coordinate of each transcript.
The result of range() is a GRangesList, to allow for the possibility of trans-splicing. But there is none of that in the current annotation (since each range is represented by a single element)
table(elementLengths(rng))
so we have
Maybe you're looking for the start and end coordinate of each exon. Then you might be interested simply in the start()s (as an IntegerList, i.e., a list where each element of the list is an integer vector, with length corresponding to number of exons in the transcript) or the unlisted version
These operations are fast.
For your second question, suppose we had two 'reads'
and we wanted to test whether they were entirely within the ranges defined by exByTx
to keep just those within the gene model we could do
A looser criterion might by
%over%
, or more generally thesubsetByOverlaps()
function; you could do this on millions of reads at a time, without a loop. Trimming is a bit harder, but you could reduce your GRangesList to non-overlapping regionsfind which of your reads overlap these regions
and finally calculate the 'parallel' intersection between the reads and the regions that they overlap
Some further processing (e.g.,
splitAsList(p, queryHits(olaps))
) might be necessary if your reads overlap more than one range.Hope that helps, and sorry if I have still misunderstood your questions.
Those are some good ideas thanks