Question: Retrieving specific sequences from a fastq
0
gravatar for ferbecneu
12 months ago by
ferbecneu0
ferbecneu0 wrote:

Hi, Im working with a 4c-seq experiment and one of the issues with this technique is that you have to delete reads that originated adjacent from the sequence of interest, I have the sequence corresponding to these reads, I need to delete these reads, till now I only found the reads that have this sequence using vmatchpattern but how can i delete these reads and create a new fastq archive without them?

Thank you very much for your help!

biostrings shortread • 303 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by ferbecneu0
Answer: Retrieving specific sequences from a fastq
0
gravatar for Martin Morgan
12 months ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

See ShortRead::fastqFilter() including the example.

ADD COMMENTlink written 12 months ago by Martin Morgan ♦♦ 23k
Answer: Retrieving specific sequences from a fastq
0
gravatar for ferbecneu
12 months ago by
ferbecneu0
ferbecneu0 wrote:

Thank you Martin, I read the function you said: (filterFastq(files, destinations, ..., filter = FilterRules(), compress=TRUE, yieldSize = 1000000L). However Im starting with bioinformatics and I dont know how to write filter/filterRules to remove the reads that contain the sequence: GCAGTGGGGCGGGTCTAAGCAGA.

Can you help me doing so please? Thank you very much!

ADD COMMENTlink written 12 months ago by ferbecneu0

Looks like this is a COMMENT rather than Answer; please use the 'Add comment' button for comments.

To get some example data, I entered the following into my R session

library(ShortRead)
example(fastqFilter)

This creates a variable that points to a small fastq file; I read it in and extracted the short read sequences, just to get a look

> sr = readFastq(fl)
> sread(sr)
  A DNAStringSet instance of length 256
      width seq
  [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
  [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
  [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
  [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
  [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
  ...   ... ...
[252]    36 GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGT
[253]    36 GTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
[254]    36 GATGAACTAAGTCAACCTCAGCACTAACCTTGCGAG
[255]    36 GTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTA
[256]    36 GCAATCTGCCGACCACTCGCGATTCAATCATGACTT

Suppose I wanted to get rid of sequences with "TTACC". I could use the grepl() function to find the reads that do not contain this pattern, and subset the original reads to get those that I want to keep

> sr[!grepl("TTACC", sread(sr))]
class: ShortReadQ
length: 244 reads; width: 36 cycles

So it looks like there are 244 reads that satisfy my criterion. Now write a function that does this, and test...

fun <- function(x)
    x[ !grepl("TTACC", sread(x)) ]

Verify that it works

> fun(sr)
class: ShortReadQ
length: 244 reads; width: 36 cycles

Then use it in the filterFastq function, creating a new fastq file with the filtered results

> out <- tempfile()
> filterFastq(fl, out, filter=fun)
[1] "/var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//Rtmpnfw9lM/file172496fd16a17"
attr(,"filter")
                 Reads KeptReads Nucl KeptNucl
s_1_sequence.txt   256       244 9216     8784

verify that the output file contains the correct number of sequences

> readFastq(out)
class: ShortReadQ
length: 244 reads; width: 36 cycles
ADD REPLYlink written 12 months ago by Martin Morgan ♦♦ 23k

Thank you very much! This was super useful for me :D

ADD REPLYlink written 12 months ago by ferbecneu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour