I am analyzing deep sequencing data and I would like to manipulate these large data (>100,000 reads - can be either fasta or bam format) to do the followings:
#1 - Exclude primer sequences (short strings of 25-30 nt)
e.g. if I want to exclude all the match 'CAAACTCAAATCTAATCTAACCAAAAAAAC' and 'CAACCTTTTAATCTAACCAAAAAAAC'
#2 - Filter out the short reads (< a 100 bp)?
#3 - And finally exclude reverse oriented sequences?
I am using outside R tools (samtools) but it would be great to have all running in R...
thanks in advance!