Search
2
3.6 years ago by
USA

Hi,

I am using the ShortRead package to analyze my fastq files. Could someone explain to me how does the function trimtails in this package function?

My plan is find certain positions with a lower quality score and to trim these positions from my fastq file. How can one do this? I am rather confused with the function trimtails because the explanation seems rather vague. If this is not the function for trimming certain positions, how can I achive this?

Thanks a lot!

modified 3.5 years ago by Martin Morgan ♦♦ 22k • written 3.6 years ago by cookie20
3
3.5 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

The trim* functions remove nucleotides from the beginning or ending of reads only, not from the middle. The easiest to understand is trimEnds(), which might be used to remove low-quality bases at the beginning and / or end of the reads.

sp <- SolexaPath(system.file('extdata', package='ShortRead'))   # path to a fastq
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")  # input data

This is what the quality scores look like

> quality(rfq)
class: SFastqQuality
quality:
A BStringSet instance of length 256
width seq
[1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
[2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
[3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
[4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
[5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
...   ... ...
[252]    36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255]    36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256]    36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH

and the available 'alphabet' representing the qualities, from low quality to high quaility

> alphabet(quality(rfq))
[1] " "  "!"  "\"" "#"  "\$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "."
[16] "/"  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  ":"  ";"  "<"  "="
[31] ">"  "?"  "@"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L"
[46] "M"  "N"  "O"  "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "["
[61] "\\" "]"  "^"  "_"  ""  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"
[76] "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y"
[91] "z"  "{"  "|"  "}"


See this wiki entry for description of how these codes translate into probabilities. So to trim tails with quality less than the letter 'L', for instance,

> rfq1 = trimEnds(rfq, "L")
> quality(rfq1)
class: SFastqQuality
quality:
A BStringSet instance of length 256
width seq
[1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
[2]    32 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZP
[3]    32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
[4]    35 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLO
[5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
...   ... ...
[252]    35 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAO
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    34 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOO
[255]    35 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJS
[256]    35 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHN

The first record is not trimmed, because the last nucleotide has quality 'S' , which is higher than the threshold 'L'. The second record removes ICCK, all of which are less than L, etc.

The other trim* functions scan from left to right along the read, calculating a score for each nucleotide in the read. The read is trimmed at the first location where the score falls below a threshold. trimTailw() function calculates the score based on a sliding window, whereas trimTails() accumulates a score for all preceding nucleotides.

These functions act on objects; if you'd like to go from an untrimmed fastq file to a trimmed fastq file you'd follow the instructions section 3.2 of the vignette.

Hi, I would like to know if I can use this package to trim reads that have a primer sequence that  will be in all my reads due to the methodology followed for the generation of the sequencing library. Its a 4c-seq experiment and the primer sequence won´t allow me to align my reads to the reference genome. Greetings!!

If the primers are always at the same position, e.g., the first 10 nucleotides, use, e.g., narrow(srq, start = 11)`. Likely you would iterate through files, reading a chunk, trimming, writing the chunk, and repeating, as demonstrated in section 3.2 of the vignette.