Entering edit mode
> library(systemPipeR) Loading required package: Rsamtools Loading required package: GenomeInfoDb Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colMeans, colnames, colSums, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which, which.max, which.min Loading required package: S4Vectors Loading required package: stats4 Attaching package: ‘S4Vectors’ The following object is masked from ‘package:base’: expand.grid Loading required package: IRanges Loading required package: GenomicRanges Loading required package: Biostrings Loading required package: XVector Attaching package: ‘Biostrings’ The following object is masked from ‘package:base’: strsplit Loading required package: ShortRead Loading required package: BiocParallel Loading required package: GenomicAlignments Loading required package: SummarizedExperiment Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:BiocGenerics’: dims Loading required package: DelayedArray Loading required package: matrixStats Attaching package: ‘matrixStats’ The following objects are masked from ‘package:Biobase’: anyMissing, rowMedians Attaching package: ‘DelayedArray’ The following objects are masked from ‘package:matrixStats’: colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges The following object is masked from ‘package:Biostrings’: type The following objects are masked from ‘package:base’: aperm, apply Warning messages: 1: no function found corresponding to methods exports from ‘XVector’ for: ‘concatenateObjects’ 2: replacing previous import ‘BiocGenerics::dims’ by ‘Biobase::dims’ when loading ‘SummarizedExperiment’ 3: replacing previous import ‘Biobase::dims’ by ‘DelayedArray::dims’ when loading ‘SummarizedExperiment’ 4: replacing previous import ‘BiocGenerics::dims’ by ‘Biobase::dims’ when loading ‘ShortRead’ 5: replacing previous import ‘BiocGenerics::dims’ by ‘Biobase::dims’ when loading ‘AnnotationDbi’ 6: replacing previous import ‘Biobase::dims’ by ‘BiocGenerics::dims’ when loading ‘AnnotationForge’ > > targets <- read.delim("targetsPE_chip.txt") > targets[1:4,-c(5,6)] FileName1 FileName2 SampleName Factor 1 BN.H4.1_1.fastq.gz BN.H4.1_2.fastq.gz BN.H4.1 BNH4 2 BN.H4.2_1.fastq.gz BN.H4.2_2.fastq.gz BN.H4.2 BNH4 3 BN.INPUT_1.fastq.gz BN.INPUT_2.fastq.gz BN.I BNI 4 AN.H4.1_1.fastq.gz AN.H4.1_2.fastq.gz AN.H4.1 ANH4 > args <- systemArgs(sysma="param/trimPE.param", mytargets="targetsPE_chip.txt") > filterFct <- function(fq, cutoff=20, Nexceptions=0) { + qcount <- rowSums(as(quality(fq), "matrix") <= cutoff) + fq[qcount <= Nexceptions] # Retains reads where Phred scores are >= cutoff with N exceptions + } > preprocessReads(args=args, Fct="filterFct(fq, cutoff=20, Nexceptions=0)", batchsize=100000) Error: logical subscript contains NAs > writeTargetsout(x=args, file="targets_chip_trim.txt", overwrite=TRUE) Written content of 'targetsout(x)' to file: targets_chip_trim.txt > q() [user@login chipseq]$ cat param/trimPE.param ## Parameter file for read preprocessing ## This is a dummy file to support within the systemPipeR workflow ## R-based functions that read from input files and write to output files ## Values in <...> reference column titles in targets file ## PairSet 'type' row specifies whether sofware is commandline (system=TRUE) or R (system=FALSE). If line is ommited 'commandline' will be used PairSet Name Value type system FALSE modules software NA trim cores -p 40 other NA outfile1 -o <FileName1> outfile1 path ./results/ outfile1 remove .gz outfile1 append _trim.gz outfile1 outextension _trim.gz outfile2 -o <FileName2> outfile2 path ./results/ outfile2 remove .gz outfile2 append _trim.gz outfile2 outextension _trim.gz reference NA NA infile1 NA <FileName1> infile1 path NA infile2 NA <FileName2> infile2 path NA [user@login chipseq]$ cat targetsPE_chip.txt
FileName1 FileName2 SampleName Factor BN.H4.1_1.fastq.gz BN.H4.1_2.fastq.gz BN.H4.1 BH4 BN.H4.2_1.fastq.gz BN.H4.2_2.fastq.gz BN.H4.2 BH4 BN.INPUT_1.fastq.gz BN.INPUT_2.fastq.gz BN.I BI AN.H4.1_1.fastq.gz AN.H4.1_2.fastq.gz AN.H4.1 ANH4 AN.H4.2_1.fastq.gz AN.H4.2_2.fastq.gz AN.H4.2 ANH4 AN.INPUT_1.fastq.gz AN.INPUT_2.fastq.gz AN.I ANI
I've tried everything that I can think of but I can't figure out why I keep getting this error.
Thanks a million
Hi @gocougs,
Could you please provide the output of sessionInfo() executed in an R session right after the error occurred? In addition, did you try the same command line with the demo data? If not, could you please try to see if the same error occurs.
Hi @dcassol
Thanks so much for getting back to me! I did try the demo data and it worked fine. Also, when I run the alignment with bowtie2 using the systemPipeR package using untrimmed fastq.gz files, it worked fine. The issue is only with preProcessReads(). I've tried everything I can think of so I really appreciate your help!
Simon
It does not seem to be something with the package and/or the code since it works with the demo data.
The raw fastq data are being aligned with Tophat2 probably because somehow the software can handle/ignore some issue in the sequences, which does not occur in the preprocessReads() function.
Could you please try to execute the samples individually to find out if this problem occurs in all samples or in a specific sample? In addition, you could split the fastq files with the issue and run the sections individually to try to locate the NAs.
Thanks for the feedback.
I tried my fastq.gz files separately and still ran into the issue. I took one set of reads and unzipped them, then removed the top 12 lines. When I ran them I saw the issue again. Is there anything obvious you see wrong with the following fastq files?
I really appreciate your help.
1_1.TEST.fastq
1_2.TEST.fastq
preprocessReads
() uses internally theFastqStreamer
function from theShortRead
package to stream through large FASTQ files in a memory-efficient manner. The argumentbatchsize
is the number of reads to process in each iteration by theFastqStreamer
function. How many reads do your files have? I suggest reducing the number ofbatchsize
. In the example you just sent, if you replace the size ofbatchsize
= 2, it works.Please let me know if it worked for you.
Thanks again for your help! So setting the batchsize=2 worked for the samples above.
Interestingly enough, when I run my full fastq files with batchsize= anything but 1, it seems to fail. Each file has 50+ million reads. Think this is an issue with the paired end reads?
One interesting note, I tried subsetting my fastq files by 1000 lines, into TEST_1 and TEST_2. I then gzipped these files and ran the code above with batchsize= 2.
Which makes me think the problem has to be between reads 4 & 5.
Since reads comprise 4 lines of fastq files, my thinking is that an example of the problem has to be between lines 13-20. So I wrote this:
Hi @gocougs,
Could you please share (e.g. via Dropbox or Google Drive) an example with just one set of your paired end fastq files where each contains the first 100,000 reads in order to try reproduce the error?
Thank you very much for all the feedback.
Daniela
Thanks so much for your help!
Here is a link I created for drop box. Paired end reads TEST_1.fastq.gz TEST_2.fastq.gz, each 100,000 reads.
Simon
https://www.dropbox.com/sh/s5dw1cyfmikret8/AACWcOV2sX9WSsVMU3JIOMCxa?dl=0
Hi Simon,
In the example code/demo data, the reads in each of the fastq files should have the same length. Since the reads in your files have a variable length from 35-100bp (not sure if the reads have been gone through some tail trimming already), I believe that is the issue.
Could you please try this:
Please let me know if this works! :)
Thanks so much Daniela! That did the trick!
You're the best!
:)