error in report(qa) from pkg ShortRead
2
0
Entering edit mode
@timotheeflutre-6727
Last seen 5.0 years ago
France
Hello, I have a fastq file compressed with gzip in a directory named test/. I would like to assess its quality. Here is what I do: $ R > library(ShortRead) > qa <- qa("~/test", "fastq.gz") > report(qa, dest="~/test") And I get the following error message: Error in as.data.frame(lapply(df, sprintf, fmt = fmt)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Error in FUN(X[[1L]], ...) : invalid format '%.3g'; use format %s for character objects Here are more details: > traceback() 7: as.data.frame(lapply(df, sprintf, fmt = fmt)) 6: .df2a(qa[["adapterContamination"]]) 5: hwrite(.df2a(qa[["adapterContamination"]]), border = 0) 4: func(x, dest, type, ...) 3: func(x, dest, type, ...) 2: report(qa, dest = "~/test") 1: report(qa, dest = "~/test") > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ShortRead_1.22.0 GenomicAlignments_1.0.2 BSgenome_1.32.0 [4] Rsamtools_1.16.1 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 [7] Biostrings_2.32.1 XVector_0.4.0 IRanges_1.22.9 [10] BiocParallel_0.6.1 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] BatchJobs_1.3 BBmisc_1.7 Biobase_2.24.0 [4] bitops_1.0-6 brew_1.0-6 checkmate_1.1 [7] codetools_0.2-8 compiler_3.1.0 DBI_0.2-7 [10] digest_0.6.4 fail_1.2 foreach_1.4.2 [13] grid_3.1.0 hwriter_1.3.1 iterators_1.0.7 [16] lattice_0.20-29 latticeExtra_0.6-26 RColorBrewer_1.0-5 [19] RSQLite_0.11.4 sendmailR_1.1-2 stats4_3.1.0 [22] stringr_0.6.2 tools_3.1.0 zlibbioc_1.10.0 Timoth?e Flutre Charg? de Recherche / Research Scientist INRA - Centre de Montpellier http://umr-agap.cirad.fr/en http://openwetware.org/wiki/User:Timothee_Flutre
• 1.3k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 16 days ago
United States
On 09/11/2014 08:38 AM, Timoth?e Flutre wrote: > Hello, > > I have a fastq file compressed with gzip in a directory named test/. > I would like to assess its quality. Here is what I do: > > $ R >> library(ShortRead) >> qa <- qa("~/test", "fastq.gz") >> report(qa, dest="~/test") > > And I get the following error message: > Error in as.data.frame(lapply(df, sprintf, fmt = fmt)) : > error in evaluating the argument 'x' in selecting a method for > function 'as.data.frame': Error in FUN(X[[1L]], ...) : > invalid format '%.3g'; use format %s for character objects > > Here are more details: >> traceback() > 7: as.data.frame(lapply(df, sprintf, fmt = fmt)) > 6: .df2a(qa[["adapterContamination"]]) > 5: hwrite(.df2a(qa[["adapterContamination"]]), border = 0) I see you are in the half of the R users who dislike factors! I think qa[["adapterContamination"]] should be a data.frame with a single column 'contamination', and that the single column should be a factor or numeric; I think you have set options(stringsAsFactors=FALSE) and so instead of a factor or numeric it is character. The workaround is to set options(stringsAsFactors=TRUE) (or not set this option at all!). This will be fixed in the next release of ShortRead. Thanks for the report, and sorry for the inconvenience. Martin > 4: func(x, dest, type, ...) > 3: func(x, dest, type, ...) > 2: report(qa, dest = "~/test") > 1: report(qa, dest = "~/test") > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] > LC_ADDRESS=C LC_TELEPHONE=C [11] > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods [8] base > > other attached packages: > [1] ShortRead_1.22.0 GenomicAlignments_1.0.2 BSgenome_1.32.0 > [4] Rsamtools_1.16.1 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > [7] Biostrings_2.32.1 XVector_0.4.0 IRanges_1.22.9 > [10] BiocParallel_0.6.1 BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.3 BBmisc_1.7 Biobase_2.24.0 [4] > bitops_1.0-6 brew_1.0-6 checkmate_1.1 [7] > codetools_0.2-8 compiler_3.1.0 DBI_0.2-7 [10] digest_0.6.4 > fail_1.2 foreach_1.4.2 [13] grid_3.1.0 > hwriter_1.3.1 iterators_1.0.7 [16] lattice_0.20-29 > latticeExtra_0.6-26 RColorBrewer_1.0-5 [19] RSQLite_0.11.4 > sendmailR_1.1-2 stats4_3.1.0 [22] stringr_0.6.2 tools_3.1.0 > zlibbioc_1.10.0 > > Timoth?e Flutre > Charg? de Recherche / Research Scientist > INRA - Centre de Montpellier > http://umr-agap.cirad.fr/en > http://openwetware.org/wiki/User:Timothee_Flutre > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Thanks, I removed stringsAsFactors=TRUE from my ~/.Rprofile! When I have several files, I encountered the following error: > files <- dir("~/test", "*.fastq.gz$", full=TRUE) qas <- > qaSummary(files, type="fastq.gz") Error: could not find function "qaSummary" Even though it is present in the Overview vignette (http://www.bioconductor.org/packages/release/bioc/vignettes/ShortRead /inst/doc/Overview.pdf). I guess that qaSummary() is in fact deprecated in favor of qa(), right? Moreover, when fed with several large fastq files, qa() seems much slower than FastQC. Would it be possible to add a progress bar to qa()? For instance, via this function (http://stat.ethz.ch/R-manual/R-patched/library/utils/html/txtProgress Bar.html) or this package (http://cran.r-project.org/web/packages/pbapply/)? I had a quick look at the ShortRead pkg source code, but couldn't find easily where to add this. I also tried to get a sense of the time it takes to run a single file, but encountered the following error: > system.time(qa <- qa(dirPath="~/test", > pattern="RPI2_S1_L001_R1_001.fastq.gz", type="fastq", sample=TRUE)) user system elapsed 26.719 0.490 22.565 > system.time(qa <- qa(dirPath="~/test", > pattern="RPI2_S1_L001_R1_001.fastq.gz", type="fastq", sample=FALSE)) Error: 1 errors; first error: Error: UserArgumentMismatch 'pattern' must be 'character(0) or character(1)' For more information, use bplasterror(). To resume calculation, re- call the function and set the argument 'BPRESUME' to TRUE or wrap the previous call in bpresume(). First traceback: 28: system.time(qa <- qa(dirPath = "~/test", pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq", sample = FALSE)) 27: qa(dirPath = "~/test", pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq", sample = FALSE) 26: qa(dirPath = "~/test", pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq", sample = FALSE) 25: .local(dirPath, ...) 24: .qa_fastq(dirPath, pattern, type = type, ...) 23: bplapply(fls, .qa_fastq_lane, type = type, ..., verbose = verbose) 22: bplapply(fls Timing stopped at: 0.013 0 0.013 > bplasterror() 0 / 1 partial results stored. First 1 error messages: [1]: Error: UserArgumentMismatch 'pattern' must be 'character(0) or character(1)' I don't understand why the same command works with sample=TRUE, but doesn't with sample=FALSE. Timoth?e Flutre Charg? de Recherche / Research Scientist INRA - Centre de Montpellier http://umr-agap.cirad.fr/en http://openwetware.org/wiki/User:Timothee_Flutre"Martin Morgan" <mtmorgan at="" fhcrc.org=""> a ?crit : > On 09/11/2014 08:38 AM, Timoth?e Flutre wrote: >> Hello, >> >> I have a fastq file compressed with gzip in a directory named test/. >> I would like to assess its quality. Here is what I do: >> >> $ R >>> library(ShortRead) >>> qa <- qa("~/test", "fastq.gz") >>> report(qa, dest="~/test") >> >> And I get the following error message: >> Error in as.data.frame(lapply(df, sprintf, fmt = fmt)) : >> error in evaluating the argument 'x' in selecting a method for >> function 'as.data.frame': Error in FUN(X[[1L]], ...) : >> invalid format '%.3g'; use format %s for character objects >> >> Here are more details: >>> traceback() >> 7: as.data.frame(lapply(df, sprintf, fmt = fmt)) >> 6: .df2a(qa[["adapterContamination"]]) >> 5: hwrite(.df2a(qa[["adapterContamination"]]), border = 0) > > I see you are in the half of the R users who dislike factors! I think > > qa[["adapterContamination"]] > > should be a data.frame with a single column 'contamination', and that > the single column should be a factor or numeric; I think you have set > > options(stringsAsFactors=FALSE) > > and so instead of a factor or numeric it is character. > > The workaround is to set options(stringsAsFactors=TRUE) (or not set > this option at all!). This will be fixed in the next release of ShortRead. > > Thanks for the report, and sorry for the inconvenience. > > Martin > >> 4: func(x, dest, type, ...) >> 3: func(x, dest, type, ...) >> 2: report(qa, dest = "~/test") >> 1: report(qa, dest = "~/test") >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 >> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 >> LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] >> LC_ADDRESS=C LC_TELEPHONE=C [11] >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods [8] base >> >> other attached packages: >> [1] ShortRead_1.22.0 GenomicAlignments_1.0.2 BSgenome_1.32.0 >> [4] Rsamtools_1.16.1 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 >> [7] Biostrings_2.32.1 XVector_0.4.0 IRanges_1.22.9 >> [10] BiocParallel_0.6.1 BiocGenerics_0.10.0 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.3 BBmisc_1.7 Biobase_2.24.0 [4] >> bitops_1.0-6 brew_1.0-6 checkmate_1.1 [7] >> codetools_0.2-8 compiler_3.1.0 DBI_0.2-7 [10] digest_0.6.4 >> fail_1.2 foreach_1.4.2 [13] grid_3.1.0 >> hwriter_1.3.1 iterators_1.0.7 [16] lattice_0.20-29 >> latticeExtra_0.6-26 RColorBrewer_1.0-5 [19] RSQLite_0.11.4 >> sendmailR_1.1-2 stats4_3.1.0 [22] stringr_0.6.2 tools_3.1.0 >> zlibbioc_1.10.0 >> >> Timoth?e Flutre >> Charg? de Recherche / Research Scientist >> INRA - Centre de Montpellier >> http://umr-agap.cirad.fr/en >> http://openwetware.org/wiki/User:Timothee_Flutre >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 >
ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 16 days ago
United States

I answered Timothée via email, and am including the answer here for reference.

On 09/12/2014 05:40 AM, Timothée Flutre wrote:
> Thanks, I removed stringsAsFactors=TRUE from my ~/.Rprofile!
>
> When I have several files, I encountered the following error:
>> files <- dir("~/test", "*.fastq.gz$", full=TRUE) qas <- qaSummary(files,
>> type="fastq.gz")
> Error: could not find function "qaSummary"
>
> Even though it is present in the Overview vignette
> (http://www.bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf).
>
> I guess that qaSummary() is in fact deprecated in favor of qa(), right?

I think in that document it's used as a variable name, not a function? I don't think there's ever been a qaSummary function.

>
> Moreover, when fed with several large fastq files, qa() seems much slower than
> FastQC. Would it be possible to add a progress bar to qa()? For instance, via
> this function
> (http://stat.ethz.ch/R-manual/R-patched/library/utils/html/txtProgressBar.html)
> or this package (http://cran.r-project.org/web/packages/pbapply/)? I had a quick
> look at the ShortRead pkg source code, but couldn't find easily where to add this.

I'm not sure a progress bar would make it go faster but yes, that's a good idea; it should be added to BiocParallel. Adding verbose=TRUE will report as each file is being processed.

> I also tried to get a sense of the time it takes to run a single file, but
> encountered the following error:
>> system.time(qa <- qa(dirPath="~/test", pattern="RPI2_S1_L001_R1_001.fastq.gz",
>> type="fastq", sample=TRUE))
>     user  system elapsed
>   26.719   0.490  22.565
>> system.time(qa <- qa(dirPath="~/test", pattern="RPI2_S1_L001_R1_001.fastq.gz",
>> type="fastq", sample=FALSE))
> Error: 1 errors; first error:
>    Error: UserArgumentMismatch
>    'pattern' must be 'character(0) or character(1)'
> For more information, use bplasterror(). To resume calculation, re-call
>    the function and set the argument 'BPRESUME' to TRUE or wrap the
>    previous call in bpresume().
> First traceback:
>    28: system.time(qa <- qa(dirPath = "~/test",
>            pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq",
>            sample = FALSE))
>    27: qa(dirPath = "~/test",
>            pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq",
>            sample = FALSE)
>    26: qa(dirPath = "~/test",
>            pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq",
>            sample = FALSE)
>    25: .local(dirPath, ...)
>    24: .qa_fastq(dirPath, pattern, type = type, ...)
>    23: bplapply(fls, .qa_fastq_lane, type = type, ..., verbose = verbose)
>    22: bplapply(fls
> Timing stopped at: 0.013 0 0.013
>> bplasterror()
> 0 / 1 partial results stored. First 1 error messages:
> [1]: Error: UserArgumentMismatch
>    'pattern' must be 'character(0) or character(1)'
>
> I don't understand why the same command works with sample=TRUE, but doesn't with
> sample=FALSE.

That was a bug introduced fairly recently; it's been corrected in the 'devel' version 1.23.17 available all being well about this time tomorrow. Instructions for using the devel version are at

    http://bioconductor.org/developers/how-to/useDevel/

I don't expect reading the full file to be anywhere near competitive with fastqc, but since we're interested in summary statistics it doesn't have to be!

For timing I see

$ time ./fastqc ~/benchmark/E-MTAB-1147/fastq/ERR127302_1.fastq.gz
[...]
real    2m37.193s
user    2m36.463s
sys    0m2.186s

versus

$ cat qa-test.R
suppressPackageStartupMessages(library(ShortRead))
fl <- "~/benchmark/E-MTAB-1147/fastq/ERR127302_1.fastq.gz"
rpt <- report(qa(fl))
~/benchmark/ShortRead-qa$ time R --silent --vanilla -f qa-test.R
> suppressPackageStartupMessages(library(ShortRead))
> fl <- "~/benchmark/E-MTAB-1147/fastq/ERR127302_1.fastq.gz"
> rpt <- report(qa(fl))
>

real    1m47.893s
user    1m44.692s
sys    0m3.068s

This uses the default sampling of 1M reads using about 4G of RAM (which seems a little excessive...); ShortRead will run in parallel if fed several files; see ?qa and the BPPARAM argument to control how parallel evaluation works, in particular you might want to make sure you don't consume all memory (e.g., using options(mc.cores=2), as this will cause swapping and serious performance degradation.

ADD COMMENT

Login before adding your answer.

Traffic: 752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6