Question

Trimming with ShortRead

2

Entering edit mode

cookie ▴ 20

@cookie-7508

Last seen 10.9 years ago

USA

Hi,

I am using the ShortRead package to analyze my fastq files. Could someone explain to me how does the function trimtails in this package function?

My plan is find certain positions with a lower quality score and to trim these positions from my fastq file. How can one do this? I am rather confused with the function trimtails because the explanation seems rather vague. If this is not the function for trimming certain positions, how can I achive this?

Thanks a lot!

cookie

shortread • 4.3k views

ADD COMMENT • link updated 10.8 years ago by Martin Morgan 25k • written 10.8 years ago by cookie ▴ 20

score 3 · Accepted Answer · 2015-04-03

The trim* functions remove nucleotides from the beginning or ending of reads only, not from the middle. The easiest to understand is trimEnds(), which might be used to remove low-quality bases at the beginning and / or end of the reads.

sp <- SolexaPath(system.file('extdata', package='ShortRead'))   # path to a fastq
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")  # input data

This is what the quality scores look like

> quality(rfq)
class: SFastqQuality
quality:
  A BStringSet instance of length 256
      width seq
  [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
  [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
  [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
  [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
  [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
  ...   ... ...
[252]    36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255]    36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256]    36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH

and the available 'alphabet' representing the qualities, from low quality to high quaility

> alphabet(quality(rfq))
 [1] " "  "!"  "\"" "#"  "$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "." 
[16] "/"  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  ":"  ";"  "<"  "=" 
[31] ">"  "?"  "@"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L" 
[46] "M"  "N"  "O"  "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "[" 
[61] "\\" "]"  "^"  "_"  "`"  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j" 
[76] "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y" 
[91] "z"  "{"  "|"  "}"

See this wiki entry for description of how these codes translate into probabilities. So to trim tails with quality less than the letter 'L', for instance,

> rfq1 = trimEnds(rfq, "L")
> quality(rfq1)
class: SFastqQuality
quality:
  A BStringSet instance of length 256
      width seq
  [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
  [2]    32 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZP
  [3]    32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
  [4]    35 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLO
  [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
  ...   ... ...
[252]    35 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAO
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    34 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOO
[255]    35 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJS
[256]    35 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHN

The first record is not trimmed, because the last nucleotide has quality 'S' , which is higher than the threshold 'L'. The second record removes ICCK, all of which are less than L, etc.

The other trim* functions scan from left to right along the read, calculating a score for each nucleotide in the read. The read is trimmed at the first location where the score falls below a threshold. trimTailw() function calculates the score based on a sliding window, whereas trimTails() accumulates a score for all preceding nucleotides.

These functions act on objects; if you'd like to go from an untrimmed fastq file to a trimmed fastq file you'd follow the instructions section 3.2 of the vignette.