Apologies if this has been asked before, I've found several similar questions but nothing quite what I want...
I am analysing some data from a 3'UTR-RNA-seq protocol. In a 150bp NextSeq read, what often happens is that you get something like:
12bpAdapter---100bpRNA--15bpPolyA--20bpAdapter
The 12bpAdapter is fixed, but the size of the Poly and 3' Adapter is not fixed.
Having got my data as a ShortReadQ object, I can then use narrow(fq, start=13)
to get rid of the 12bp adapter, but I'm not sure how to get rid of the PolyA onwards. Intuitively I want to do something like grep('AAAAAAAA', string)
to get the position of the start of the polyA and then use narrow(fq, end=pos)
again where necessary.
The trimTailw function is nice for filtering on quality scores, a windowing alphabetScore
function would be another way of doing it. There is also trimLRPatterns
but this will only find a pattern at the end of the sequence, not in the middle (I think). Finally, using vmatchPattern('AAAAAAAA',sread(fq))
I can sort of work out the position of the polyA but this seems a bit clumsy.
Any guidance to point me in the right direction would be appreciated - I'm sure someone must have done this before!. Would rather do this in R as I'm going to use Rsubread for the alignment step.
Many thanks,
Phil
Thanks Martin. I tried your method but although it worked fine on the example when I applied it to larger datasets (1M reads) it was slowed right down.
What I ended up with was the below converting the reads to characters and then just doing a regexpr:
match_pos
is a vector the same length asrfq
with -1 where no match was found. So replacing -1 with the width of the read yields a vector which can be fed intonarrow
.Oddly whilst the regexpr part was slower than
vMatchPattern
, it was running start on the ByPos_MIndex object that really snarled things up even on small subsets of my data, not sure what was going on here but I'll post back if I figure it out - might be to do with the fact that searching for poly-A's means I get many hits and regexp just returns the first one whilst vMatchPattern returns all of them.Thanks again for the help
Hi Martin,
Just an example with timings to show what happens when the ByPos_MIndex start method gets called. Again the start point is example(trimEnds). I can see that the start method for a ByPos_MIndex object is derived from that from a RangesList object which in turn uses an lapply so presumably this is the slow bit?
Yes, I guess the MIndex objects are not using the latest tricks. This
gets us the list-of-starts quickly, or more directly for the purpose at hand
Also,
Probably there is an underlying reason why the return value of VRangesList is not compressed by default...
I would have thought that working with
vmatchPattern()
would be more memory efficient thanregexpr()
(not duplicating the read data as character vectors), but I haven't had too much luck in the speculation department on this thread.Since regexpr is vectorized I think you could
regexpr("AATC", sread(rfq))
and this would be much faster than using sapply? I think also using thefixed=TRUE
argument to regexpr will speed things up. Sorry that vmatchPattern() turns out to be not as speedy in this case.Here are some timings
Fantastic thanks Martin, I had forgotten about that. I get a x10 speed boost on 1 milion reads
For completeness this is what I get with 1000 reads: