Using ShortRead (or other) to trim from pattern match onwards
1
1
Entering edit mode
chapmandu2 ▴ 20
@chapmandu2-10053
Last seen 5.8 years ago

Apologies if this has been asked before, I've found several similar questions but nothing quite what I want...

I am analysing some data from a 3'UTR-RNA-seq protocol.  In a 150bp NextSeq read, what often happens is that you get something like:

12bpAdapter---100bpRNA--15bpPolyA--20bpAdapter

The 12bpAdapter is fixed, but the size of the Poly and 3' Adapter is not fixed.

Having got my data as a ShortReadQ object, I can then use narrow(fq, start=13) to get rid of the 12bp adapter, but I'm not sure how to get rid of the PolyA onwards.  Intuitively I want to do something like grep('AAAAAAAA', string) to get the position of the start of the polyA and then use narrow(fq, end=pos) again where necessary. 

The trimTailw function is nice for filtering on quality scores, a windowing alphabetScore function would be another way of doing it.  There is also trimLRPatterns but this will only find a pattern at the end of the sequence, not in the middle (I think).  Finally, using vmatchPattern('AAAAAAAA',sread(fq)) I can sort of work out the position of the polyA but this seems a bit clumsy.

Any guidance to point me in the right direction would be appreciated - I'm sure someone must have done this before!.  Would rather do this in R as I'm going to use Rsubread for the alignment step.

Many thanks,

Phil

 

 

 

shortread • 2.3k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 12 days ago
United States

I'm not sure that there's anything magical. After example(trimEnds) I did

starts = start(vmatchPattern("AATC", sread(rfq), fixed=TRUE))
idx = !duplicated(rep(seq_along(starts), lengths(starts)))
end[lengths(starts) != 0] = unlist(start(match))[idx]
narrow(rfq, end=end)

which chose to trim from the first match; fromLast=TRUE in duplicated() would trim from the last match.

ADD COMMENT
0
Entering edit mode

Thanks Martin.  I tried your method but although it worked fine on the example when I applied it to larger datasets (1M reads) it was slowed right down.

What I ended up with was the below converting the reads to characters and then just doing a regexpr:

match_pos <- unlist(sapply(as.character(sread(rfq)), regexpr, pattern= 'AATC', USE.NAMES = FALSE))
match_pos_idx <- which(match_pos >= 0)
match_clip <- width(rfq)
match_clip[match_pos_idx] <- match_pos[match_pos_idx]
rfq <- narrow(rfq, end=match_clip)

match_pos is a vector the same length as rfq with -1 where no match was found.  So replacing -1 with the width of the read yields a vector which can be fed into narrow.

Oddly whilst the regexpr part was slower than vMatchPattern, it was running start on the ByPos_MIndex object that really snarled things up even on small subsets of my data, not sure what was going on here but I'll post back if I figure it out - might be to do with the fact that searching for poly-A's means I get many hits and regexp just returns the first one whilst vMatchPattern returns all of them.

Thanks again for the help

 

 

ADD REPLY
1
Entering edit mode

Hi Martin,

Just an example with timings to show what happens when the ByPos_MIndex start method gets called.  Again the start point is example(trimEnds).  I can see that the start method for a ByPos_MIndex object is derived from that from a RangesList object which in turn uses an lapply so presumably this is the slow bit?

> rfq = rfq[rep(1:256, 100)]
> system.time(match_pos <- regexpr("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed
  0.012   0.000   0.012
> system.time(vmp_out <- vmatchPattern("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed
  0.030   0.000   0.031
> system.time(starts <- start(vmp_out))
   user  system elapsed
 10.635   0.027  10.660
> getMethod('start', 'RangesList')
Method Definition:

function (x, ...)
{
    .local <- function (x)
    S4Vectors:::new_SimpleList_from_list("SimpleIntegerList",
        lapply(x, start))
    .local(x, ...)
}
<environment: namespace:IRanges>

Signatures:
        x           
target  "RangesList"
defined "RangesList"
> system.time(lapply(vmp_out, start))
   user  system elapsed
 10.347   0.028  10.374
> class(vmp_out[[8]])
[1] "IRanges"
attr(,"package")
[1] "IRanges"
> system.time(start(vmp_out[[8]]))
   user  system elapsed
  0.000   0.000   0.001
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.28.0           GenomicAlignments_1.6.3    SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [5] Rsamtools_1.22.0           GenomicRanges_1.22.3       GenomeInfoDb_1.6.1         Biostrings_2.38.3         
 [9] XVector_0.10.0             IRanges_2.4.6              S4Vectors_0.8.7            BiocParallel_1.4.3        
[13] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] lattice_0.20-33      bitops_1.0-6         grid_3.2.2           futile.options_1.0.0 zlibbioc_1.16.0     
 [6] hwriter_1.3.2        latticeExtra_0.6-26  futile.logger_1.4.1  RColorBrewer_1.1-2   lambda.r_1.1.7      
[11] tools_3.2.2
ADD REPLY
1
Entering edit mode

Yes, I guess the MIndex objects are not using the latest tricks. This

> system.time(s1 <- relist(start(unlist(vmp_out)), vmp_out))
   user  system elapsed 
  0.009   0.000   0.008 

gets us the list-of-starts quickly, or more directly for the purpose at hand

> system.time({ 
+     idx = !duplicated(rep(seq_along(vmp_out), lengths(vmp_out)))
+     start(unlist(vmp_out))[idx]
+ })
   user  system elapsed 
  0.006   0.000   0.006 

Also,

> system.time(start(as(vmp_out, "CompressedIRangesList")))
   user  system elapsed 
  0.009   0.000   0.010 

Probably there is an underlying reason why the return value of VRangesList is not compressed by default...

I would have thought that working with vmatchPattern() would be more memory efficient than regexpr() (not duplicating the read data as character vectors), but I haven't had too much luck in the speculation department on this thread.

ADD REPLY
0
Entering edit mode

Since regexpr is vectorized I think you could regexpr("AATC", sread(rfq)) and this would be much faster than using sapply? I think also using the fixed=TRUE argument to regexpr will speed things up. Sorry that vmatchPattern() turns out to be not as speedy in this case.

Here are some timings

> rfq = rfq[rep(1:256, 4096)]
> system.time(regexpr("AATC", sread(rfq), fixed=TRUE))
   user  system elapsed 
  0.817   0.000   0.817 
> system.time(regexpr("AATC", sread(rfq), fixed=FALSE))
   user  system elapsed 
  1.848   0.000   1.847 
> system.time(vmatchPattern("AATC", sread(rfq), fixed=TRUE))   user  system elapsed 
  4.830   0.008   4.834
> system.time(sapply(as.character(sread(rfq)), regexpr, pattern="AATC", fixed=TRUE))
   user  system elapsed 
  7.735   0.044   7.778 

 

ADD REPLY
1
Entering edit mode

Fantastic thanks Martin, I had forgotten about that.  I get a x10 speed boost on 1 milion reads

> system.time(polyApos <- sapply(as.character(sread(fq3)), regexpr, pattern= paste0(rep('A',10), collapse=''), USE.NAMES = FALSE) %>% unlist())
   user  system elapsed
 11.738   0.042  11.778
> system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
   user  system elapsed
  0.986   0.004   0.991
ADD REPLY
0
Entering edit mode

For completeness this is what I get with 1000 reads:

system.time(polyApos_vmp <- vmatchPattern(paste0(rep('A',10), collapse=''),sread(fq4), fixed=TRUE))
   user  system elapsed
   0.02    0.00    0.02

system.time(starts <- start(polyApos_vmp))  ### SLOW!!!
   user  system elapsed
  3.845   0.030   3.874

system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
   user  system elapsed
  0.008   0.000   0.008

> polyApos_vmp
MIndex object of length 9492
[[1]]
IRanges of length 0

[[2]]
IRanges of length 10
     start end width
[1]     93 102    10
[2]     94 103    10
[3]     95 104    10
[4]     96 105    10
[5]     97 106    10
[6]     98 107    10
[7]     99 108    10
[8]    100 109    10
[9]    101 110    10
[10]   102 111    10

[[3]]
IRanges of length 0

...
<9489 more elements>

> starts
IntegerList of length 9492
[[1]] integer(0)
[[2]] 93 94 95 96 97 98 99 100 101 102
[[3]] integer(0)
[[4]] integer(0)
[[5]] 106 107 108 109 110 111 112 113
[[6]] integer(0)
[[7]] integer(0)
[[8]] integer(0)
[[9]] integer(0)
[[10]] 60 61 62 63 64 65 66 67
...
<9482 more elements>
ADD REPLY

Login before adding your answer.

Traffic: 755 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6