Trimming with ShortRead
1
2
Entering edit mode
cookie ▴ 20
@cookie-7508
Last seen 7.5 years ago
USA

Hi,


I am using the ShortRead package to analyze my fastq files. Could someone explain to me how does the function trimtails in this package function?

My plan is find certain positions with a lower quality score and to trim these positions from my fastq file. How can one do this? I am rather confused with the function trimtails because the explanation seems rather vague. If this is not the function for trimming certain positions, how can I achive this?

 

 

Thanks a lot!

cookie

shortread • 2.3k views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 13 days ago
United States

The trim* functions remove nucleotides from the beginning or ending of reads only, not from the middle. The easiest to understand is trimEnds(), which might be used to remove low-quality bases at the beginning and / or end of the reads.

sp <- SolexaPath(system.file('extdata', package='ShortRead'))   # path to a fastq
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")  # input data

This is what the quality scores look like

> quality(rfq)
class: SFastqQuality
quality:
  A BStringSet instance of length 256
      width seq
  [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
  [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
  [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
  [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
  [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
  ...   ... ...
[252]    36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255]    36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256]    36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH

and the available 'alphabet' representing the qualities, from low quality to high quaility

> alphabet(quality(rfq))
 [1] " "  "!"  "\"" "#"  "$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "." 
[16] "/"  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  ":"  ";"  "<"  "=" 
[31] ">"  "?"  "@"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L" 
[46] "M"  "N"  "O"  "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "[" 
[61] "\\" "]"  "^"  "_"  "`"  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j" 
[76] "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y" 
[91] "z"  "{"  "|"  "}" 

See this wiki entry for description of how these codes translate into probabilities. So to trim tails with quality less than the letter 'L', for instance,

> rfq1 = trimEnds(rfq, "L")
> quality(rfq1)
class: SFastqQuality
quality:
  A BStringSet instance of length 256
      width seq
  [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
  [2]    32 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZP
  [3]    32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
  [4]    35 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLO
  [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
  ...   ... ...
[252]    35 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAO
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    34 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOO
[255]    35 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJS
[256]    35 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHN

The first record is not trimmed, because the last nucleotide has quality 'S' , which is higher than the threshold 'L'. The second record removes ICCK, all of which are less than L, etc.

The other trim* functions scan from left to right along the read, calculating a score for each nucleotide in the read. The read is trimmed at the first location where the score falls below a threshold. trimTailw() function calculates the score based on a sliding window, whereas trimTails() accumulates a score for all preceding nucleotides. 

These functions act on objects; if you'd like to go from an untrimmed fastq file to a trimmed fastq file you'd follow the instructions section 3.2 of the vignette.

 

 

 

 

ADD COMMENT
0
Entering edit mode

Hi, I would like to know if I can use this package to trim reads that have a primer sequence that  will be in all my reads due to the methodology followed for the generation of the sequencing library. Its a 4c-seq experiment and the primer sequence won´t allow me to align my reads to the reference genome. Greetings!!

ADD REPLY
0
Entering edit mode

If the primers are always at the same position, e.g., the first 10 nucleotides, use, e.g., narrow(srq, start = 11). Likely you would iterate through files, reading a chunk, trimming, writing the chunk, and repeating, as demonstrated in section 3.2 of the vignette.

ADD REPLY
0
Entering edit mode

Thanks Martin, It was very useful!

ADD REPLY

Login before adding your answer.

Traffic: 435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6