Using ShortRead (or other) to trim from pattern match onwards
Entering edit mode
chapmandu2 ▴ 20
Last seen 4.2 years ago

Apologies if this has been asked before, I've found several similar questions but nothing quite what I want...

I am analysing some data from a 3'UTR-RNA-seq protocol.  In a 150bp NextSeq read, what often happens is that you get something like:


The 12bpAdapter is fixed, but the size of the Poly and 3' Adapter is not fixed.

Having got my data as a ShortReadQ object, I can then use narrow(fq, start=13) to get rid of the 12bp adapter, but I'm not sure how to get rid of the PolyA onwards.  Intuitively I want to do something like grep('AAAAAAAA', string) to get the position of the start of the polyA and then use narrow(fq, end=pos) again where necessary. 

The trimTailw function is nice for filtering on quality scores, a windowing alphabetScore function would be another way of doing it.  There is also trimLRPatterns but this will only find a pattern at the end of the sequence, not in the middle (I think).  Finally, using vmatchPattern('AAAAAAAA',sread(fq)) I can sort of work out the position of the polyA but this seems a bit clumsy.

Any guidance to point me in the right direction would be appreciated - I'm sure someone must have done this before!.  Would rather do this in R as I'm going to use Rsubread for the alignment step.

Many thanks,





shortread • 1.5k views
Entering edit mode
Last seen 13 days ago
United States

I'm not sure that there's anything magical. After example(trimEnds) I did

starts = start(vmatchPattern("AATC", sread(rfq), fixed=TRUE))
idx = !duplicated(rep(seq_along(starts), lengths(starts)))
end[lengths(starts) != 0] = unlist(start(match))[idx]
narrow(rfq, end=end)

which chose to trim from the first match; fromLast=TRUE in duplicated() would trim from the last match.

Entering edit mode

Thanks Martin.  I tried your method but although it worked fine on the example when I applied it to larger datasets (1M reads) it was slowed right down.

What I ended up with was the below converting the reads to characters and then just doing a regexpr:

match_pos <- unlist(sapply(as.character(sread(rfq)), regexpr, pattern= 'AATC', USE.NAMES = FALSE))
match_pos_idx <- which(match_pos >= 0)
match_clip <- width(rfq)
match_clip[match_pos_idx] <- match_pos[match_pos_idx]
rfq <- narrow(rfq, end=match_clip)

match_pos is a vector the same length as rfq with -1 where no match was found.  So replacing -1 with the width of the read yields a vector which can be fed into narrow.

Oddly whilst the regexpr part was slower than vMatchPattern, it was running start on the ByPos_MIndex object that really snarled things up even on small subsets of my data, not sure what was going on here but I'll post back if I figure it out - might be to do with the fact that searching for poly-A's means I get many hits and regexp just returns the first one whilst vMatchPattern returns all of them.

Thanks again for the help



Entering edit mode

Hi Martin,

Just an example with timings to show what happens when the ByPos_MIndex start method gets called.  Again the start point is example(trimEnds).  I can see that the start method for a ByPos_MIndex object is derived from that from a RangesList object which in turn uses an lapply so presumably this is the slow bit?

> rfq = rfq[rep(1:256, 100)]
> system.time(match_pos <- regexpr("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed
  0.012   0.000   0.012
> system.time(vmp_out <- vmatchPattern("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed
  0.030   0.000   0.031
> system.time(starts <- start(vmp_out))
   user  system elapsed
 10.635   0.027  10.660
> getMethod('start', 'RangesList')
Method Definition:

function (x, ...)
    .local <- function (x)
        lapply(x, start))
    .local(x, ...)
<environment: namespace:IRanges>

target  "RangesList"
defined "RangesList"
> system.time(lapply(vmp_out, start))
   user  system elapsed
 10.347   0.028  10.374
> class(vmp_out[[8]])
[1] "IRanges"
[1] "IRanges"
> system.time(start(vmp_out[[8]]))
   user  system elapsed
  0.000   0.000   0.001
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.28.0           GenomicAlignments_1.6.3    SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [5] Rsamtools_1.22.0           GenomicRanges_1.22.3       GenomeInfoDb_1.6.1         Biostrings_2.38.3         
 [9] XVector_0.10.0             IRanges_2.4.6              S4Vectors_0.8.7            BiocParallel_1.4.3        
[13] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] lattice_0.20-33      bitops_1.0-6         grid_3.2.2           futile.options_1.0.0 zlibbioc_1.16.0     
 [6] hwriter_1.3.2        latticeExtra_0.6-26  futile.logger_1.4.1  RColorBrewer_1.1-2   lambda.r_1.1.7      
[11] tools_3.2.2
Entering edit mode

Yes, I guess the MIndex objects are not using the latest tricks. This

> system.time(s1 <- relist(start(unlist(vmp_out)), vmp_out))
   user  system elapsed 
  0.009   0.000   0.008 

gets us the list-of-starts quickly, or more directly for the purpose at hand

> system.time({ 
+     idx = !duplicated(rep(seq_along(vmp_out), lengths(vmp_out)))
+     start(unlist(vmp_out))[idx]
+ })
   user  system elapsed 
  0.006   0.000   0.006 


> system.time(start(as(vmp_out, "CompressedIRangesList")))
   user  system elapsed 
  0.009   0.000   0.010 

Probably there is an underlying reason why the return value of VRangesList is not compressed by default...

I would have thought that working with vmatchPattern() would be more memory efficient than regexpr() (not duplicating the read data as character vectors), but I haven't had too much luck in the speculation department on this thread.

Entering edit mode

Since regexpr is vectorized I think you could regexpr("AATC", sread(rfq)) and this would be much faster than using sapply? I think also using the fixed=TRUE argument to regexpr will speed things up. Sorry that vmatchPattern() turns out to be not as speedy in this case.

Here are some timings

> rfq = rfq[rep(1:256, 4096)]
> system.time(regexpr("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed 
  0.817   0.000   0.817 
> system.time(regexpr("AATC", sread(rfq), fixed=FALSE))
   user  system elapsed 
  1.848   0.000   1.847 
> system.time(vmatchPattern("AATC", sread(rfq), fixed=TRUE))   user  system elapsed 
  4.830   0.008   4.834
> system.time(sapply(as.character(sread(rfq)), regexpr, pattern="AATC", fixed=TRUE))
   user  system elapsed 
  7.735   0.044   7.778 


Entering edit mode

Fantastic thanks Martin, I had forgotten about that.  I get a x10 speed boost on 1 milion reads

> system.time(polyApos <- sapply(as.character(sread(fq3)), regexpr, pattern= paste0(rep('A',10), collapse=''), USE.NAMES = FALSE) %>% unlist())
   user  system elapsed
 11.738   0.042  11.778
> system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
   user  system elapsed
  0.986   0.004   0.991
Entering edit mode

For completeness this is what I get with 1000 reads:

system.time(polyApos_vmp <- vmatchPattern(paste0(rep('A',10), collapse=''),sread(fq4), fixed=TRUE))
   user  system elapsed
   0.02    0.00    0.02

system.time(starts <- start(polyApos_vmp))  ### SLOW!!!
   user  system elapsed
  3.845   0.030   3.874

system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
   user  system elapsed
  0.008   0.000   0.008

> polyApos_vmp
MIndex object of length 9492
IRanges of length 0

IRanges of length 10
     start end width
[1]     93 102    10
[2]     94 103    10
[3]     95 104    10
[4]     96 105    10
[5]     97 106    10
[6]     98 107    10
[7]     99 108    10
[8]    100 109    10
[9]    101 110    10
[10]   102 111    10

IRanges of length 0

<9489 more elements>

> starts
IntegerList of length 9492
[[1]] integer(0)
[[2]] 93 94 95 96 97 98 99 100 101 102
[[3]] integer(0)
[[4]] integer(0)
[[5]] 106 107 108 109 110 111 112 113
[[6]] integer(0)
[[7]] integer(0)
[[8]] integer(0)
[[9]] integer(0)
[[10]] 60 61 62 63 64 65 66 67
<9482 more elements>

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6