Question

Using ShortRead (or other) to trim from pattern match onwards

1

Entering edit mode

chapmandu2 ▴ 20

@chapmandu2-10053

Last seen 5.8 years ago

Apologies if this has been asked before, I've found several similar questions but nothing quite what I want...

I am analysing some data from a 3'UTR-RNA-seq protocol. In a 150bp NextSeq read, what often happens is that you get something like:

12bpAdapter---100bpRNA--15bpPolyA--20bpAdapter

The 12bpAdapter is fixed, but the size of the Poly and 3' Adapter is not fixed.

Having got my data as a ShortReadQ object, I can then use narrow(fq, start=13) to get rid of the 12bp adapter, but I'm not sure how to get rid of the PolyA onwards. Intuitively I want to do something like grep('AAAAAAAA', string) to get the position of the start of the polyA and then use narrow(fq, end=pos) again where necessary.

The trimTailw function is nice for filtering on quality scores, a windowing alphabetScore function would be another way of doing it. There is also trimLRPatterns but this will only find a pattern at the end of the sequence, not in the middle (I think). Finally, using vmatchPattern('AAAAAAAA',sread(fq)) I can sort of work out the position of the polyA but this seems a bit clumsy.

Any guidance to point me in the right direction would be appreciated - I'm sure someone must have done this before!. Would rather do this in R as I'm going to use Rsubread for the alignment step.

Many thanks,

Phil

shortread • 2.3k views

ADD COMMENT • link updated 8.1 years ago by Martin Morgan 25k • written 8.1 years ago by chapmandu2 ▴ 20

score 1 · Answer 1 · 2016-04-06

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 days ago

United States

I'm not sure that there's anything magical. After example(trimEnds) I did

starts = start(vmatchPattern("AATC", sread(rfq), fixed=TRUE))
idx = !duplicated(rep(seq_along(starts), lengths(starts)))
end[lengths(starts) != 0] = unlist(start(match))[idx]
narrow(rfq, end=end)

which chose to trim from the first match; fromLast=TRUE in duplicated() would trim from the last match.

ADD COMMENT • link 8.1 years ago Martin Morgan 25k

0

Entering edit mode

Thanks Martin. I tried your method but although it worked fine on the example when I applied it to larger datasets (1M reads) it was slowed right down.

What I ended up with was the below converting the reads to characters and then just doing a regexpr:

match_pos <- unlist(sapply(as.character(sread(rfq)), regexpr, pattern= 'AATC', USE.NAMES = FALSE))
match_pos_idx <- which(match_pos >= 0)
match_clip <- width(rfq)
match_clip[match_pos_idx] <- match_pos[match_pos_idx]
rfq <- narrow(rfq, end=match_clip)

match_pos is a vector the same length as rfq with -1 where no match was found. So replacing -1 with the width of the read yields a vector which can be fed into narrow.

Oddly whilst the regexpr part was slower than vMatchPattern, it was running start on the ByPos_MIndex object that really snarled things up even on small subsets of my data, not sure what was going on here but I'll post back if I figure it out - might be to do with the fact that searching for poly-A's means I get many hits and regexp just returns the first one whilst vMatchPattern returns all of them.

Thanks again for the help

ADD REPLY • link 8.1 years ago chapmandu2 ▴ 20

1

Entering edit mode

Hi Martin,

Just an example with timings to show what happens when the ByPos_MIndex start method gets called. Again the start point is example(trimEnds). I can see that the start method for a ByPos_MIndex object is derived from that from a RangesList object which in turn uses an lapply so presumably this is the slow bit?

> rfq = rfq[rep(1:256, 100)]
> system.time(match_pos <- regexpr("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed
  0.012   0.000   0.012
> system.time(vmp_out <- vmatchPattern("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed
  0.030   0.000   0.031
> system.time(starts <- start(vmp_out))
   user  system elapsed
 10.635   0.027  10.660
> getMethod('start', 'RangesList')
Method Definition:

function (x, ...)
{
    .local <- function (x)
    S4Vectors:::new_SimpleList_from_list("SimpleIntegerList",
        lapply(x, start))
    .local(x, ...)
}
<environment: namespace:IRanges>

Signatures:
        x           
target  "RangesList"
defined "RangesList"
> system.time(lapply(vmp_out, start))
   user  system elapsed
 10.347   0.028  10.374
> class(vmp_out[[8]])
[1] "IRanges"
attr(,"package")
[1] "IRanges"
> system.time(start(vmp_out[[8]]))
   user  system elapsed
  0.000   0.000   0.001
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.28.0           GenomicAlignments_1.6.3    SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [5] Rsamtools_1.22.0           GenomicRanges_1.22.3       GenomeInfoDb_1.6.1         Biostrings_2.38.3         
 [9] XVector_0.10.0             IRanges_2.4.6              S4Vectors_0.8.7            BiocParallel_1.4.3        
[13] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] lattice_0.20-33      bitops_1.0-6         grid_3.2.2           futile.options_1.0.0 zlibbioc_1.16.0     
 [6] hwriter_1.3.2        latticeExtra_0.6-26  futile.logger_1.4.1  RColorBrewer_1.1-2   lambda.r_1.1.7      
[11] tools_3.2.2

ADD REPLY • link 8.1 years ago phil.chapman ▴ 150

1

Entering edit mode

Yes, I guess the MIndex objects are not using the latest tricks. This

> system.time(s1 <- relist(start(unlist(vmp_out)), vmp_out))
   user  system elapsed 
  0.009   0.000   0.008

gets us the list-of-starts quickly, or more directly for the purpose at hand

> system.time({ 
+     idx = !duplicated(rep(seq_along(vmp_out), lengths(vmp_out)))
+     start(unlist(vmp_out))[idx]
+ })
   user  system elapsed 
  0.006   0.000   0.006

Also,

> system.time(start(as(vmp_out, "CompressedIRangesList")))
   user  system elapsed 
  0.009   0.000   0.010

Probably there is an underlying reason why the return value of VRangesList is not compressed by default...

I would have thought that working with vmatchPattern() would be more memory efficient than regexpr() (not duplicating the read data as character vectors), but I haven't had too much luck in the speculation department on this thread.

ADD REPLY • link 8.1 years ago Martin Morgan 25k

0

Entering edit mode

Since regexpr is vectorized I think you could regexpr("AATC", sread(rfq)) and this would be much faster than using sapply? I think also using the fixed=TRUE argument to regexpr will speed things up. Sorry that vmatchPattern() turns out to be not as speedy in this case.

Here are some timings

> rfq = rfq[rep(1:256, 4096)]
> system.time(regexpr("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed 
  0.817   0.000   0.817 
> system.time(regexpr("AATC", sread(rfq), fixed=FALSE))
   user  system elapsed 
  1.848   0.000   1.847 
> system.time(vmatchPattern("AATC", sread(rfq), fixed=TRUE))   user  system elapsed 
  4.830   0.008   4.834
> system.time(sapply(as.character(sread(rfq)), regexpr, pattern="AATC", fixed=TRUE))
   user  system elapsed 
  7.735   0.044   7.778

ADD REPLY • link 8.1 years ago Martin Morgan 25k

1

Entering edit mode

Fantastic thanks Martin, I had forgotten about that. I get a x10 speed boost on 1 milion reads

> system.time(polyApos <- sapply(as.character(sread(fq3)), regexpr, pattern= paste0(rep('A',10), collapse=''), USE.NAMES = FALSE) %>% unlist())
   user  system elapsed
 11.738   0.042  11.778
> system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
   user  system elapsed
  0.986   0.004   0.991

ADD REPLY • link 8.1 years ago chapmandu2 ▴ 20

0

Entering edit mode

For completeness this is what I get with 1000 reads:

system.time(polyApos_vmp <- vmatchPattern(paste0(rep('A',10), collapse=''),sread(fq4), fixed=TRUE))
   user  system elapsed
   0.02    0.00    0.02

system.time(starts <- start(polyApos_vmp))  ### SLOW!!!
   user  system elapsed
  3.845   0.030   3.874

system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
   user  system elapsed
  0.008   0.000   0.008

> polyApos_vmp
MIndex object of length 9492
[[1]]
IRanges of length 0

[[2]]
IRanges of length 10
     start end width
[1]     93 102    10
[2]     94 103    10
[3]     95 104    10
[4]     96 105    10
[5]     97 106    10
[6]     98 107    10
[7]     99 108    10
[8]    100 109    10
[9]    101 110    10
[10]   102 111    10

[[3]]
IRanges of length 0

...
<9489 more elements>

> starts
IntegerList of length 9492
[[1]] integer(0)
[[2]] 93 94 95 96 97 98 99 100 101 102
[[3]] integer(0)
[[4]] integer(0)
[[5]] 106 107 108 109 110 111 112 113
[[6]] integer(0)
[[7]] integer(0)
[[8]] integer(0)
[[9]] integer(0)
[[10]] 60 61 62 63 64 65 66 67
...
<9482 more elements>

ADD REPLY • link 8.1 years ago chapmandu2 ▴ 20