Using ShortRead (or other) to trim from pattern match onwards
1
1
Entering edit mode
chapmandu2 ▴ 20
@chapmandu2-10053
Last seen 4.2 years ago

Apologies if this has been asked before, I've found several similar questions but nothing quite what I want...

I am analysing some data from a 3'UTR-RNA-seq protocol.  In a 150bp NextSeq read, what often happens is that you get something like:

The 12bpAdapter is fixed, but the size of the Poly and 3' Adapter is not fixed.

Having got my data as a ShortReadQ object, I can then use narrow(fq, start=13) to get rid of the 12bp adapter, but I'm not sure how to get rid of the PolyA onwards.  Intuitively I want to do something like grep('AAAAAAAA', string) to get the position of the start of the polyA and then use narrow(fq, end=pos) again where necessary.

The trimTailw function is nice for filtering on quality scores, a windowing alphabetScore function would be another way of doing it.  There is also trimLRPatterns but this will only find a pattern at the end of the sequence, not in the middle (I think).  Finally, using vmatchPattern('AAAAAAAA',sread(fq)) I can sort of work out the position of the polyA but this seems a bit clumsy.

Any guidance to point me in the right direction would be appreciated - I'm sure someone must have done this before!.  Would rather do this in R as I'm going to use Rsubread for the alignment step.

Many thanks,

Phil

1
Entering edit mode
@martin-morgan-1513
Last seen 13 days ago
United States

I'm not sure that there's anything magical. After example(trimEnds) I did

starts = start(vmatchPattern("AATC", sread(rfq), fixed=TRUE))
idx = !duplicated(rep(seq_along(starts), lengths(starts)))
end[lengths(starts) != 0] = unlist(start(match))[idx]
narrow(rfq, end=end)

which chose to trim from the first match; fromLast=TRUE in duplicated() would trim from the last match.

0
Entering edit mode

Thanks Martin.  I tried your method but although it worked fine on the example when I applied it to larger datasets (1M reads) it was slowed right down.

What I ended up with was the below converting the reads to characters and then just doing a regexpr:

match_pos <- unlist(sapply(as.character(sread(rfq)), regexpr, pattern= 'AATC', USE.NAMES = FALSE))
match_pos_idx <- which(match_pos >= 0)
match_clip <- width(rfq)
match_clip[match_pos_idx] <- match_pos[match_pos_idx]
rfq <- narrow(rfq, end=match_clip)

match_pos is a vector the same length as rfq with -1 where no match was found.  So replacing -1 with the width of the read yields a vector which can be fed into narrow.

Oddly whilst the regexpr part was slower than vMatchPattern, it was running start on the ByPos_MIndex object that really snarled things up even on small subsets of my data, not sure what was going on here but I'll post back if I figure it out - might be to do with the fact that searching for poly-A's means I get many hits and regexp just returns the first one whilst vMatchPattern returns all of them.

Thanks again for the help

1
Entering edit mode

Hi Martin,

Just an example with timings to show what happens when the ByPos_MIndex start method gets called.  Again the start point is example(trimEnds).  I can see that the start method for a ByPos_MIndex object is derived from that from a RangesList object which in turn uses an lapply so presumably this is the slow bit?

> rfq = rfq[rep(1:256, 100)]
> system.time(match_pos <- regexpr("AATC", sread(rfq), fixed=TRUE))
user  system elapsed
0.012   0.000   0.012
> system.time(vmp_out <- vmatchPattern("AATC", sread(rfq), fixed=TRUE))
user  system elapsed
0.030   0.000   0.031
> system.time(starts <- start(vmp_out))
user  system elapsed
10.635   0.027  10.660
> getMethod('start', 'RangesList')
Method Definition:

function (x, ...)
{
.local <- function (x)
S4Vectors:::new_SimpleList_from_list("SimpleIntegerList",
lapply(x, start))
.local(x, ...)
}
<environment: namespace:IRanges>

Signatures:
x
target  "RangesList"
defined "RangesList"
> system.time(lapply(vmp_out, start))
user  system elapsed
10.347   0.028  10.374
> class(vmp_out[[8]])
[1] "IRanges"
attr(,"package")
[1] "IRanges"
> system.time(start(vmp_out[[8]]))
user  system elapsed
0.000   0.000   0.001
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[5] Rsamtools_1.22.0           GenomicRanges_1.22.3       GenomeInfoDb_1.6.1         Biostrings_2.38.3
[9] XVector_0.10.0             IRanges_2.4.6              S4Vectors_0.8.7            BiocParallel_1.4.3
[13] BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] lattice_0.20-33      bitops_1.0-6         grid_3.2.2           futile.options_1.0.0 zlibbioc_1.16.0
[6] hwriter_1.3.2        latticeExtra_0.6-26  futile.logger_1.4.1  RColorBrewer_1.1-2   lambda.r_1.1.7
[11] tools_3.2.2
1
Entering edit mode

Yes, I guess the MIndex objects are not using the latest tricks. This

> system.time(s1 <- relist(start(unlist(vmp_out)), vmp_out))
user  system elapsed
0.009   0.000   0.008 

gets us the list-of-starts quickly, or more directly for the purpose at hand

> system.time({
+     idx = !duplicated(rep(seq_along(vmp_out), lengths(vmp_out)))
+     start(unlist(vmp_out))[idx]
+ })
user  system elapsed
0.006   0.000   0.006


Also,

> system.time(start(as(vmp_out, "CompressedIRangesList")))
user  system elapsed
0.009   0.000   0.010 

Probably there is an underlying reason why the return value of VRangesList is not compressed by default...

I would have thought that working with vmatchPattern() would be more memory efficient than regexpr() (not duplicating the read data as character vectors), but I haven't had too much luck in the speculation department on this thread.

0
Entering edit mode

Since regexpr is vectorized I think you could regexpr("AATC", sread(rfq)) and this would be much faster than using sapply? I think also using the fixed=TRUE argument to regexpr will speed things up. Sorry that vmatchPattern() turns out to be not as speedy in this case.

Here are some timings

> rfq = rfq[rep(1:256, 4096)]
user  system elapsed
0.817   0.000   0.817
user  system elapsed
1.848   0.000   1.847
> system.time(vmatchPattern("AATC", sread(rfq), fixed=TRUE))   user  system elapsed
4.830   0.008   4.834
user  system elapsed
7.735   0.044   7.778


1
Entering edit mode

Fantastic thanks Martin, I had forgotten about that.  I get a x10 speed boost on 1 milion reads

> system.time(polyApos <- sapply(as.character(sread(fq3)), regexpr, pattern= paste0(rep('A',10), collapse=''), USE.NAMES = FALSE) %>% unlist())
user  system elapsed
11.738   0.042  11.778
> system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
user  system elapsed
0.986   0.004   0.991
0
Entering edit mode

For completeness this is what I get with 1000 reads:

system.time(polyApos_vmp <- vmatchPattern(paste0(rep('A',10), collapse=''),sread(fq4), fixed=TRUE))
user  system elapsed
0.02    0.00    0.02

system.time(starts <- start(polyApos_vmp))  ### SLOW!!!
user  system elapsed
3.845   0.030   3.874

system.time(polyApos <- regexpr(pattern= paste0(rep('A',10), collapse=''), sread(fq3), fixed=TRUE) %>% unlist())
user  system elapsed
0.008   0.000   0.008

> polyApos_vmp
MIndex object of length 9492
[[1]]
IRanges of length 0

[[2]]
IRanges of length 10
start end width
[1]     93 102    10
[2]     94 103    10
[3]     95 104    10
[4]     96 105    10
[5]     97 106    10
[6]     98 107    10
[7]     99 108    10
[8]    100 109    10
[9]    101 110    10
[10]   102 111    10

[[3]]
IRanges of length 0

...
<9489 more elements>

> starts
IntegerList of length 9492
[[1]] integer(0)
[[2]] 93 94 95 96 97 98 99 100 101 102
[[3]] integer(0)
[[4]] integer(0)
[[5]] 106 107 108 109 110 111 112 113
[[6]] integer(0)
[[7]] integer(0)
[[8]] integer(0)
[[9]] integer(0)
[[10]] 60 61 62 63 64 65 66 67
...
<9482 more elements>