Search
Question: Trimming with ShortRead
2
gravatar for cookie
3.6 years ago by
cookie20
USA
cookie20 wrote:

Hi,


I am using the ShortRead package to analyze my fastq files. Could someone explain to me how does the function trimtails in this package function?

My plan is find certain positions with a lower quality score and to trim these positions from my fastq file. How can one do this? I am rather confused with the function trimtails because the explanation seems rather vague. If this is not the function for trimming certain positions, how can I achive this?

 

 

Thanks a lot!

cookie

ADD COMMENTlink modified 3.5 years ago by Martin Morgan ♦♦ 22k • written 3.6 years ago by cookie20
3
gravatar for Martin Morgan
3.5 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

The trim* functions remove nucleotides from the beginning or ending of reads only, not from the middle. The easiest to understand is trimEnds(), which might be used to remove low-quality bases at the beginning and / or end of the reads.

sp <- SolexaPath(system.file('extdata', package='ShortRead'))   # path to a fastq
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")  # input data

This is what the quality scores look like

> quality(rfq)
class: SFastqQuality
quality:
  A BStringSet instance of length 256
      width seq
  [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
  [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
  [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
  [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
  [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
  ...   ... ...
[252]    36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255]    36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256]    36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH

and the available 'alphabet' representing the qualities, from low quality to high quaility

> alphabet(quality(rfq))
 [1] " "  "!"  "\"" "#"  "$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "." 
[16] "/"  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  ":"  ";"  "<"  "=" 
[31] ">"  "?"  "@"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L" 
[46] "M"  "N"  "O"  "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "[" 
[61] "\\" "]"  "^"  "_"  "`"  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j" 
[76] "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y" 
[91] "z"  "{"  "|"  "}" 

See this wiki entry for description of how these codes translate into probabilities. So to trim tails with quality less than the letter 'L', for instance,

> rfq1 = trimEnds(rfq, "L")
> quality(rfq1)
class: SFastqQuality
quality:
  A BStringSet instance of length 256
      width seq
  [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
  [2]    32 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZP
  [3]    32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
  [4]    35 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLO
  [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
  ...   ... ...
[252]    35 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAO
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    34 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOO
[255]    35 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJS
[256]    35 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHN

The first record is not trimmed, because the last nucleotide has quality 'S' , which is higher than the threshold 'L'. The second record removes ICCK, all of which are less than L, etc.

The other trim* functions scan from left to right along the read, calculating a score for each nucleotide in the read. The read is trimmed at the first location where the score falls below a threshold. trimTailw() function calculates the score based on a sliding window, whereas trimTails() accumulates a score for all preceding nucleotides. 

These functions act on objects; if you'd like to go from an untrimmed fastq file to a trimmed fastq file you'd follow the instructions section 3.2 of the vignette.

 

 

 

 

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Martin Morgan ♦♦ 22k

Hi, I would like to know if I can use this package to trim reads that have a primer sequence that  will be in all my reads due to the methodology followed for the generation of the sequencing library. Its a 4c-seq experiment and the primer sequence won´t allow me to align my reads to the reference genome. Greetings!!

ADD REPLYlink written 4 months ago by ferbecneu0

If the primers are always at the same position, e.g., the first 10 nucleotides, use, e.g., narrow(srq, start = 11). Likely you would iterate through files, reading a chunk, trimming, writing the chunk, and repeating, as demonstrated in section 3.2 of the vignette.

ADD REPLYlink written 4 months ago by Martin Morgan ♦♦ 22k

Thanks Martin, It was very useful!

ADD REPLYlink written 15 days ago by ferbecneu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 164 users visited in the last hour