I have been using Rsubread to perform alignment for a large number of paired-end gzFASTQ files using the align() function, and have been careful to log all of the output to keep track of summary information and any error/warning messages that may have been produced. Of the alignments which reported a possibly incorrect Phred Offset (using a default of +33), only one was actually incorrect; the remaining alignments, which there were ~4-5 of, (involving ~8-10 fq.gz files), all reported a perceived range of [35,84], which is not a valid score range. I decided to zcat the fq.gz files and use AWK extract every 4th line to a text file, then wrote and ran a script to keep track of the running values for minimum and maximum observed ascii values. In all cases, the range was actually 35-74. My hypothesis (for which I have no evidence) is that the quality scores are being read from a line containing read information, as 'T' is ASCII 84 and is also the highest-ASCII-value nucleotide abbreviation. I have the quality scores for the paired-end reads for one sample, as well as the R console output from a successful reproduction of this possible bug. The former are 333MB each, so I cannot provide them in their entirety; however, I can provide the full output. The files in question were obtained from the PCGC data hub, so anyone with access to said data hub may find them there. I am including the two remote filepaths (the names are slightly different from the ones I have, but they are the same files) so that anyone with an account may retrieve them on their own with ExpeDat (https://pcgc.research.chop.edu/data_expedition), as long as they have a PCGC account. Here are the remote paths:
resrhdxp01.research.chop.edu:/pcgc/public/Other/transcriptome/fastq/PCGC0065312_HS_TX__1-02059__v1_FC427_L8_p3of6_P2.fastq.gz resrhdxp01.research.chop.edu:/pcgc/public/Other/transcriptome/fastq/PCGC0065308_HS_TX__1-02059__v1_FC427_L8_p3of6_P1.fastq.gz
The command I ran to extract every fourth line is
zcat $1 | awk '!(NR%4)'
The script I used to calculate the minimum and maximum values was written in haskell, and the source is short enough that I will include it here:
import System.IO
import Control.Monad
import Data.Char
import Data.List
main = getLine >>= l.minmax
l t = do
let (m,m') = t in putStrLn (show (ord m) ++ "-" ++ show (ord m'))
foe <- isEOF
when (not foe) (getLine >>= l.(mim t).minmax)
minmax (h:x) = foldl' mm (h,h) x
where
mm (a,b) c | c < a = (c,b) | c > b = (a,c) | otherwise = (a,b)
mim (a,b) (c,d) = (min a c, max b d)
If you save this to "phred.hs", then compile it with "ghc phred.hs", you can run
zcat $1 | awk '!(NR%4)' | ./phred | uniq
to view the running cumulative minimum and maximum ascii values. If any information I provided here is insufficiently descriptive or ambiguously worded, or if more information is necessary, I will be happy to provide whatever information I can to help resolve this bug.
Here is the output of my R session:
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> require(Rsubread)
Loading required package: Rsubread
> align(index = "genome",
+ readfile1 = "120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R1.fq.gz",
+ readfile2 = "120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R2.fq.gz",
+ input_format = "gzFASTQ",
+ output_format = "BAM",
+ output_file = "test.bam",
+ tieBreakHamming = TRUE,
+ unique = TRUE,
+ indels = 5,
+ nthreads = 1,
+ PE_orientation = "fr")
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 1.15.9
//========================== subread-align setting ===========================\\
|| ||
|| Function : Read alignment ||
|| Threads : 1 ||
|| Input file 1 : 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R ... ||
|| Input file 2 : 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R ... ||
|| Output file : test.bam (BAM) ||
|| Index name : genome ||
|| Phred offset : 33 ||
|| ||
|| Min read1 votes : 3 ||
|| Min read2 votes : 1 ||
|| Max fragment size : 600 ||
|| Min fragment size : 50 ||
|| ||
|| Max indels : 5 ||
|| # of Best mapping : 1 ||
|| Unique mapping : yes ||
|| Hamming distance : yes ||
|| Quality scores : no ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//====================== Running (11-Nov-2014 14:44:53) ======================\\
|| ||
|| Decompress 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R1.fq.gz... ||
|| The input file contains base space reads. ||
|| Decompress 120521--FC427--L8--3of6--1-02059--RA--SeqInst3--R2.fq.gz... ||
|| WARNING The specified phred-score offset (33) seems to be incorrect. ||
|| The observed phred-score range is [35,84]. ||
|| ||
-------
ABORTED
-------
> require(Rsubread)
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_1.15.9
