Apologies if this has been asked before, I've found several similar questions but nothing quite what I want...
I am analysing some data from a 3'UTR-RNA-seq protocol. In a 150bp NextSeq read, what often happens is that you get something like:
12bpAdapter---100bpRNA--15bpPolyA--20bpAdapter
The 12bpAdapter is fixed, but the size of the Poly and 3' Adapter is not fixed.
Having got my data as a ShortReadQ object, I can then use narrow(fq, start=13) to get rid of the 12bp adapter, but I'm not sure how to get rid of the PolyA onwards. Intuitively I want to do something like grep('AAAAAAAA', string) to get the position of the start of the polyA and then use narrow(fq, end=pos) again where necessary.
The trimTailw function is nice for filtering on quality scores, a windowing alphabetScore function would be another way of doing it. There is also trimLRPatterns but this will only find a pattern at the end of the sequence, not in the middle (I think). Finally, using vmatchPattern('AAAAAAAA',sread(fq)) I can sort of work out the position of the polyA but this seems a bit clumsy.
Any guidance to point me in the right direction would be appreciated - I'm sure someone must have done this before!. Would rather do this in R as I'm going to use Rsubread for the alignment step.
Many thanks,
Phil

Thanks Martin. I tried your method but although it worked fine on the example when I applied it to larger datasets (1M reads) it was slowed right down.
What I ended up with was the below converting the reads to characters and then just doing a regexpr:
match_posis a vector the same length asrfqwith -1 where no match was found. So replacing -1 with the width of the read yields a vector which can be fed intonarrow.Oddly whilst the regexpr part was slower than
vMatchPattern, it was running start on the ByPos_MIndex object that really snarled things up even on small subsets of my data, not sure what was going on here but I'll post back if I figure it out - might be to do with the fact that searching for poly-A's means I get many hits and regexp just returns the first one whilst vMatchPattern returns all of them.Thanks again for the help
Hi Martin,
Just an example with timings to show what happens when the ByPos_MIndex start method gets called. Again the start point is example(trimEnds). I can see that the start method for a ByPos_MIndex object is derived from that from a RangesList object which in turn uses an lapply so presumably this is the slow bit?
> rfq = rfq[rep(1:256, 100)] > system.time(match_pos <- regexpr("AATC", sread(rfq), fixed=TRUE)) user system elapsed 0.012 0.000 0.012 > system.time(vmp_out <- vmatchPattern("AATC", sread(rfq), fixed=TRUE)) user system elapsed 0.030 0.000 0.031 > system.time(starts <- start(vmp_out)) user system elapsed 10.635 0.027 10.660 > getMethod('start', 'RangesList') Method Definition: function (x, ...) { .local <- function (x) S4Vectors:::new_SimpleList_from_list("SimpleIntegerList", lapply(x, start)) .local(x, ...) } <environment: namespace:IRanges> Signatures: x target "RangesList" defined "RangesList" > system.time(lapply(vmp_out, start)) user system elapsed 10.347 0.028 10.374 > class(vmp_out[[8]]) [1] "IRanges" attr(,"package") [1] "IRanges" > system.time(start(vmp_out[[8]])) user system elapsed 0.000 0.000 0.001 > sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.9.5 (Mavericks) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.28.0 GenomicAlignments_1.6.3 SummarizedExperiment_1.0.2 Biobase_2.30.0 [5] Rsamtools_1.22.0 GenomicRanges_1.22.3 GenomeInfoDb_1.6.1 Biostrings_2.38.3 [9] XVector_0.10.0 IRanges_2.4.6 S4Vectors_0.8.7 BiocParallel_1.4.3 [13] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] lattice_0.20-33 bitops_1.0-6 grid_3.2.2 futile.options_1.0.0 zlibbioc_1.16.0 [6] hwriter_1.3.2 latticeExtra_0.6-26 futile.logger_1.4.1 RColorBrewer_1.1-2 lambda.r_1.1.7 [11] tools_3.2.2Yes, I guess the MIndex objects are not using the latest tricks. This
gets us the list-of-starts quickly, or more directly for the purpose at hand
> system.time({ + idx = !duplicated(rep(seq_along(vmp_out), lengths(vmp_out))) + start(unlist(vmp_out))[idx] + }) user system elapsed 0.006 0.000 0.006Also,
Probably there is an underlying reason why the return value of VRangesList is not compressed by default...
I would have thought that working with
vmatchPattern()would be more memory efficient thanregexpr()(not duplicating the read data as character vectors), but I haven't had too much luck in the speculation department on this thread.Since regexpr is vectorized I think you could
regexpr("AATC", sread(rfq))and this would be much faster than using sapply? I think also using thefixed=TRUEargument to regexpr will speed things up. Sorry that vmatchPattern() turns out to be not as speedy in this case.Here are some timings
> rfq = rfq[rep(1:256, 4096)] > system.time(regexpr("AATC", sread(rfq), fixed=TRUE)) user system elapsed 0.817 0.000 0.817 > system.time(regexpr("AATC", sread(rfq), fixed=FALSE)) user system elapsed 1.848 0.000 1.847 > system.time(vmatchPattern("AATC", sread(rfq), fixed=TRUE)) user system elapsed 4.830 0.008 4.834 > system.time(sapply(as.character(sread(rfq)), regexpr, pattern="AATC", fixed=TRUE)) user system elapsed 7.735 0.044 7.778Fantastic thanks Martin, I had forgotten about that. I get a x10 speed boost on 1 milion reads
> system.time(polyApos <- sapply(as.character(sread(fq3)), regexpr, pattern= paste0(rep('A',10), collapse=''), USE.NAMES = FALSE) %>% unlist()) user system elapsed 11.738 0.042 11.778 > system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist()) user system elapsed 0.986 0.004 0.991For completeness this is what I get with 1000 reads:
system.time(polyApos_vmp <- vmatchPattern(paste0(rep('A',10), collapse=''),sread(fq4), fixed=TRUE)) user system elapsed 0.02 0.00 0.02 system.time(starts <- start(polyApos_vmp)) ### SLOW!!! user system elapsed 3.845 0.030 3.874 system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist()) user system elapsed 0.008 0.000 0.008 > polyApos_vmp MIndex object of length 9492 [[1]] IRanges of length 0 [[2]] IRanges of length 10 start end width [1] 93 102 10 [2] 94 103 10 [3] 95 104 10 [4] 96 105 10 [5] 97 106 10 [6] 98 107 10 [7] 99 108 10 [8] 100 109 10 [9] 101 110 10 [10] 102 111 10 [[3]] IRanges of length 0 ... <9489 more elements> > starts IntegerList of length 9492 [[1]] integer(0) [[2]] 93 94 95 96 97 98 99 100 101 102 [[3]] integer(0) [[4]] integer(0) [[5]] 106 107 108 109 110 111 112 113 [[6]] integer(0) [[7]] integer(0) [[8]] integer(0) [[9]] integer(0) [[10]] 60 61 62 63 64 65 66 67 ... <9482 more elements>