Question

Retrieving specific sequences from a fastq

0

Entering edit mode

ferbecneu • 0

@ferbecneu-16188

Last seen 5.3 years ago

Hi, Im working with a 4c-seq experiment and one of the issues with this technique is that you have to delete reads that originated adjacent from the sequence of interest, I have the sequence corresponding to these reads, I need to delete these reads, till now I only found the reads that have this sequence using vmatchpattern but how can i delete these reads and create a new fastq archive without them?

Thank you very much for your help!

shortread biostrings • 5.7k views

ADD COMMENT • link 7.3 years ago ferbecneu • 0

score 0 · Answer 1 · 2018-10-05

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

See ShortRead::fastqFilter() including the example.

ADD COMMENT • link 7.3 years ago Martin Morgan 25k

score 0 · Answer 2 · 2018-10-05

0

Entering edit mode

ferbecneu • 0

@ferbecneu-16188

Last seen 5.3 years ago

Thank you Martin, I read the function you said: (filterFastq(files, destinations, ..., filter = FilterRules(), compress=TRUE, yieldSize = 1000000L). However Im starting with bioinformatics and I dont know how to write filter/filterRules to remove the reads that contain the sequence: GCAGTGGGGCGGGTCTAAGCAGA.

Can you help me doing so please? Thank you very much!

ADD COMMENT • link 7.3 years ago ferbecneu • 0

0

Entering edit mode

Looks like this is a COMMENT rather than Answer; please use the 'Add comment' button for comments.

To get some example data, I entered the following into my R session

library(ShortRead)
example(fastqFilter)

This creates a variable that points to a small fastq file; I read it in and extracted the short read sequences, just to get a look

> sr = readFastq(fl)
> sread(sr)
  A DNAStringSet instance of length 256
      width seq
  [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
  [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
  [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
  [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
  [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
  ...   ... ...
[252]    36 GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGT
[253]    36 GTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
[254]    36 GATGAACTAAGTCAACCTCAGCACTAACCTTGCGAG
[255]    36 GTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTA
[256]    36 GCAATCTGCCGACCACTCGCGATTCAATCATGACTT

Suppose I wanted to get rid of sequences with "TTACC". I could use the grepl() function to find the reads that do not contain this pattern, and subset the original reads to get those that I want to keep

> sr[!grepl("TTACC", sread(sr))]
class: ShortReadQ
length: 244 reads; width: 36 cycles

So it looks like there are 244 reads that satisfy my criterion. Now write a function that does this, and test...

fun <- function(x)
    x[ !grepl("TTACC", sread(x)) ]

Verify that it works

> fun(sr)
class: ShortReadQ
length: 244 reads; width: 36 cycles

Then use it in the filterFastq function, creating a new fastq file with the filtered results

> out <- tempfile()
> filterFastq(fl, out, filter=fun)
[1] "/var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//Rtmpnfw9lM/file172496fd16a17"
attr(,"filter")
                 Reads KeptReads Nucl KeptNucl
s_1_sequence.txt   256       244 9216     8784

verify that the output file contains the correct number of sequences

> readFastq(out)
class: ShortReadQ
length: 244 reads; width: 36 cycles