Trim DNAStringSets using ShortRead while keeping the primer sequences?
1
1
Entering edit mode
@steveneverman-9218
Last seen 9.0 years ago
United States

I am editing/filtering NGS data using ShortRead and I want to trim the reads outside of my primer sequences. The reads are not the same length though. I know I could trim the primers and what flanks them but including the primer sequences is important to me.

Here is an example of what I'm looking for with 3 sequences in a DNAStringSet:

start with:

5' forwardprimerseqAGATCGAGTAG*reverseprimerseq*ATGCGTATA 3'
5' forwardprimerseqAGTTCAAG*reverseprimerseq*ATGCTAGATGAT 3
5' forwardprimerseqAGTTC*reverseprimerseq*ATG 3'

get in return:

5' *forwardprimerseq*AGATCGAGTAG*reverseprimerseq *3'
5' *forwardprimerseq*AGCAAG*reverseprimerseq* 3'
5' forwardprimerseqAGTTC*reverseprimerseq* 3'

Steven 

sequencing biostrings shortread • 2.6k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States

Here's some sample data

library(ShortRead)
example(readFastq)

resulting in

> sread(rfq)
  A DNAStringSet instance of length 256
      width seq
  [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
  [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
  [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
  [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
  [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
  ...   ... ...
[252]    36 GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGT
[253]    36 GTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
[254]    36 GATGAACTAAGTCAACCTCAGCACTAACCTTGCGAG
[255]    36 GTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTA
[256]    36 GCAATCTGCCGACCACTCGCGATTCAATCATGACTT

Supposing that my reverse primer sequence is 'TTTCC', that the primer sequence occurs uniquely, and that I'm interested in exact matches. Here I find matches

primer <- "TTTCC"
matches <- vmatchPattern(primer, sread(rfq), fixed=TRUE)

For simplicity I only look at those with a single match

idx <- lengths(matches) == 1L

I narrow the reads and sequences to the corresponding end() of the match

rfq[idx] <- narrow(rfq[idx], 1, unlist(end(matches)[idx]))

And verify that I've done a good job

> sread(rfq)[idx]
  A DNAStringSet instance of length 22
     width seq
 [1]    27 GGACTTTGTAGGATACCCTCGCTTTCC
 [2]    34 GTACGCTGGACTTTGTAGGATACCCTCGCTTTCC
 [3]    27 GGACTTTGTAGGATACCCTCGCTTTCC
 [4]    27 GCCACCATGATTATGACCAGTGTTTCC
 [5]    30 GGGAGGGTGTCAATCCTGACGGTTATTTCC
 ...   ... ...
[18]    30 GGACGCTCGACGCCATTAATAATGTTTTCC
[19]     7 GTTTTCC
[20]     7 GGTTTCC
[21]    19 GGAAAACACCAATCTTTCC
[22]    19 GGAAAGATTGGTGTTTTCC
> quality(rfq)[idx]
class: SFastqQuality
quality:
  A BStringSet instance of length 22
     width seq
 [1]    27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]
 [2]    34 ]]]]]]]]]]]]]]]]Y]]P]RRTYYV][VZXHF
 [3]    27 ]]]]]]]]]]]]R]]]]]YYY]VT]RY
 [4]    27 ]]]]]]V]]]]]]]]YV]]]T]]]]]]
 [5]    30 ]]]]]]]]]]]VY]]]]]R]]]T]Y]Y]RV
 ...   ... ...
[18]    30 ]]]]]]]]]O]]]]O]]R]]V]]]Y]]]WZ
[19]     7 ]]]]]]]
[20]     7 ]]]]]]]
[21]    19 ]]]]]]]]]]]Y]]]]]]]
[22]    19 ]]]]]]]]]]]]]]]]]]]

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6