Question

Error when trying to preprocessReads() : "Error: logical subscript contains NAs"

0

Entering edit mode

Deleted ▴ 70

@jsalsager-14698

Last seen 3 months ago

United States

> library(systemPipeR)

Loading required package: Rsamtools

Loading required package: GenomeInfoDb

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:


    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,

    clusterExport, clusterMap, parApply, parCapply, parLapply,

    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:stats’:


    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:


    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,

    colnames, colSums, dirname, do.call, duplicated, eval, evalq,

    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,

    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,

    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,

    rowSums, sapply, setdiff, sort, table, tapply, union, unique,

    unsplit, which, which.max, which.min


Loading required package: S4Vectors

Loading required package: stats4


Attaching package: ‘S4Vectors’


The following object is masked from ‘package:base’:


    expand.grid


Loading required package: IRanges

Loading required package: GenomicRanges

Loading required package: Biostrings

Loading required package: XVector


Attaching package: ‘Biostrings’


The following object is masked from ‘package:base’:


    strsplit


Loading required package: ShortRead

Loading required package: BiocParallel

Loading required package: GenomicAlignments

Loading required package: SummarizedExperiment

Loading required package: Biobase

Welcome to Bioconductor


    Vignettes contain introductory material; view with

    'browseVignettes()'. To cite Bioconductor, see

    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘Biobase’


The following object is masked from ‘package:BiocGenerics’:


    dims


Loading required package: DelayedArray

Loading required package: matrixStats


Attaching package: ‘matrixStats’


The following objects are masked from ‘package:Biobase’:


    anyMissing, rowMedians



Attaching package: ‘DelayedArray’


The following objects are masked from ‘package:matrixStats’:


    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges


The following object is masked from ‘package:Biostrings’:


    type


The following objects are masked from ‘package:base’:


    aperm, apply



Warning messages:

1: no function found corresponding to methods exports from ‘XVector’ for: ‘concatenateObjects’ 

2: replacing previous import ‘BiocGenerics::dims’ by ‘Biobase::dims’ when loading ‘SummarizedExperiment’ 

3: replacing previous import ‘Biobase::dims’ by ‘DelayedArray::dims’ when loading ‘SummarizedExperiment’ 

4: replacing previous import ‘BiocGenerics::dims’ by ‘Biobase::dims’ when loading ‘ShortRead’ 

5: replacing previous import ‘BiocGenerics::dims’ by ‘Biobase::dims’ when loading ‘AnnotationDbi’ 

6: replacing previous import ‘Biobase::dims’ by ‘BiocGenerics::dims’ when loading ‘AnnotationForge’ 

> 

> targets <- read.delim("targetsPE_chip.txt")

> targets[1:4,-c(5,6)]

           FileName1          FileName2 SampleName Factor

1  BN.H4.1_1.fastq.gz  BN.H4.1_2.fastq.gz     BN.H4.1    BNH4

2  BN.H4.2_1.fastq.gz  BN.H4.2_2.fastq.gz     BN.H4.2    BNH4

3 BN.INPUT_1.fastq.gz BN.INPUT_2.fastq.gz        BN.I     BNI

4 AN.H4.1_1.fastq.gz AN.H4.1_2.fastq.gz    AN.H4.1   ANH4

> args <- systemArgs(sysma="param/trimPE.param", mytargets="targetsPE_chip.txt")

> filterFct <- function(fq, cutoff=20, Nexceptions=0) {

+      qcount <- rowSums(as(quality(fq), "matrix") <= cutoff)

+      fq[qcount <= Nexceptions] # Retains reads where Phred scores are >= cutoff with N exceptions

+  }

> preprocessReads(args=args, Fct="filterFct(fq, cutoff=20, Nexceptions=0)", batchsize=100000)

Error: logical subscript contains NAs

> writeTargetsout(x=args, file="targets_chip_trim.txt", overwrite=TRUE)

Written content of 'targetsout(x)' to file: targets_chip_trim.txt 

> q()

[user@login chipseq]$ cat param/trimPE.param

## Parameter file for read preprocessing

## This is a dummy file to support within the systemPipeR workflow 

## R-based functions that read from input files and write to output files 

## Values in <...> reference column titles in targets file

## PairSet 'type' row specifies whether sofware is commandline (system=TRUE) or R (system=FALSE). If line is ommited 'commandline' will be used

PairSet Name Value

type system FALSE

modules

software NA trim

cores -p 40

other NA

outfile1 -o <FileName1>

outfile1 path ./results/

outfile1 remove .gz

outfile1 append _trim.gz

outfile1 outextension _trim.gz

outfile2 -o <FileName2>

outfile2 path ./results/

outfile2 remove .gz

outfile2 append _trim.gz

outfile2 outextension _trim.gz

reference NA NA

infile1 NA <FileName1>

infile1 path NA

infile2 NA <FileName2>

infile2 path NA

[user@login chipseq]$ cat targetsPE_chip.txt

FileName1 FileName2 SampleName Factor

BN.H4.1_1.fastq.gz BN.H4.1_2.fastq.gz BN.H4.1 BH4

BN.H4.2_1.fastq.gz BN.H4.2_2.fastq.gz BN.H4.2 BH4

BN.INPUT_1.fastq.gz BN.INPUT_2.fastq.gz BN.I BI

AN.H4.1_1.fastq.gz AN.H4.1_2.fastq.gz AN.H4.1 ANH4

AN.H4.2_1.fastq.gz AN.H4.2_2.fastq.gz AN.H4.2 ANH4

AN.INPUT_1.fastq.gz AN.INPUT_2.fastq.gz AN.I ANI

I've tried everything that I can think of but I can't figure out why I keep getting this error.

Thanks a million

systempiper • 3.6k views

ADD COMMENT • link 5.4 years ago Deleted ▴ 70

2

Entering edit mode

Hi @gocougs,

Could you please provide the output of sessionInfo() executed in an R session right after the error occurred? In addition, did you try the same command line with the demo data? If not, could you please try to see if the same error occurs.

ADD REPLY • link 5.4 years ago dcassol ▴ 100

0

Entering edit mode

Hi @dcassol

Thanks so much for getting back to me! I did try the demo data and it worked fine. Also, when I run the alignment with bowtie2 using the systemPipeR package using untrimmed fastq.gz files, it worked fine. The issue is only with preProcessReads(). I've tried everything I can think of so I really appreciate your help!

Simon

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /opt/apps/r/3.5.0/lib64/R/lib/libRblas.so
LAPACK: /opt/apps/r/3.5.0/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
[9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] systemPipeR_1.16.0          ShortRead_1.40.0          
[3] GenomicAlignments_1.18.0    SummarizedExperiment_1.10.1
[5] DelayedArray_0.8.0          matrixStats_0.54.0        
[7] Biobase_2.40.0              BiocParallel_1.14.1       
[9] Rsamtools_1.32.3            Biostrings_2.48.0         
[11] XVector_0.20.0              GenomicRanges_1.34.0      
[13] GenomeInfoDb_1.16.0         IRanges_2.16.0            
[15] S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
[1] Category_2.48.0        bitops_1.0-6           bit64_0.9-7          
[4] RColorBrewer_1.1-2     progress_1.2.0         httr_1.3.1           
[7] Rgraphviz_2.26.0       tools_3.5.0            backports_1.1.2      
[10] R6_2.2.2               DBI_1.0.0              lazyeval_0.2.1       
[13] colorspace_1.3-2       tidyselect_0.2.5       prettyunits_1.0.2    
[16] bit_1.1-14             compiler_3.5.0         sendmailR_1.2-1      
[19] graph_1.60.0           rtracklayer_1.40.6     scales_0.5.0         
[22] checkmate_1.8.5        BatchJobs_1.7          genefilter_1.62.0    
[25] RBGL_1.58.1            stringr_1.3.1          digest_0.6.15        
[28] AnnotationForge_1.24.0 base64enc_0.1-3        pkgconfig_2.0.1      
[31] limma_3.36.2           rlang_0.3.0.1          RSQLite_2.1.1        
[34] BBmisc_1.11            bindr_0.1.1            GOstats_2.48.0       
[37] hwriter_1.3.2          dplyr_0.7.8            RCurl_1.95-4.10      
[40] magrittr_1.5           GO.db_3.7.0            GenomeInfoDbData_1.1.0
[43] Matrix_1.2-14          Rcpp_1.0.0             munsell_0.4.3        
[46] stringi_1.2.2          edgeR_3.24.0           zlibbioc_1.26.0      
[49] plyr_1.8.4             grid_3.5.0             blob_1.1.1           
[52] crayon_1.3.4           lattice_0.20-35        splines_3.5.0        
[55] GenomicFeatures_1.34.1 annotate_1.58.0        hms_0.4.2            
[58] locfit_1.5-9.1         pillar_1.2.3           rjson_0.2.20         
[61] biomaRt_2.38.0         XML_3.98-1.11          glue_1.3.0           
[64] latticeExtra_0.6-28    data.table_1.11.8      gtable_0.2.0         
[67] purrr_0.2.5            assertthat_0.2.0       ggplot2_3.1.0        
[70] xtable_1.8-2           survival_2.42-3        tibble_1.4.2         
[73] pheatmap_1.0.10        AnnotationDbi_1.44.0   memoise_1.1.0        
[76] bindrcpp_0.2.2         brew_1.0-6             GSEABase_1.44.0

ADD REPLY • link 5.4 years ago Deleted ▴ 70

2

Entering edit mode

It does not seem to be something with the package and/or the code since it works with the demo data.
The raw fastq data are being aligned with Tophat2 probably because somehow the software can handle/ignore some issue in the sequences, which does not occur in the preprocessReads() function.
Could you please try to execute the samples individually to find out if this problem occurs in all samples or in a specific sample? In addition, you could split the fastq files with the issue and run the sections individually to try to locate the NAs.

ADD REPLY • link 5.4 years ago dcassol ▴ 100

0

Entering edit mode

Thanks for the feedback.

I tried my fastq.gz files separately and still ran into the issue. I took one set of reads and unzipped them, then removed the top 12 lines. When I ran them I saw the issue again. Is there anything obvious you see wrong with the following fastq files?

I really appreciate your help.

1_1.TEST.fastq

@D00472:193:CC9HHACXX:6:2210:1104:1940 1:N:0:TGACCA
AATGGNATAATGGAATGCACTCGAATGNNATCATCGAATGGACTNGAATGNNNNNNNNNNNGAGTGGAATAAANNGGAATCATCGAATGNACTTGNANGG
+
BBBFF#0BFF<FFFIIIFIFIFFBBFI##0<BFFIIIIFFIFIF#0BFFF###########00<BBBBBFFFF##077BBBFBBBBFBB#00<BB#0#00
@D00472:193:CC9HHACXX:6:2210:1118:1953 1:N:0:TGACCA
GTGCTATCTTAGGCAGATTATTTTACCNTTCAGAATTAAGTTTTCTCATTTNNNNNNNNNNGGAATTGAAGTTAGATCAGTAGTTACCAGGTTTTTTNGG
+
<BBBBBFFFFFFFFFIFIIFFFIIIII#0BFFFFIFFFFFBFBFIIFBFFF##########007<FFFIFB<BFFFFFFFFFFBFBFFBB<7<BBBB#07
@D00472:193:CC9HHACXX:6:2210:1244:1968 1:N:0:TGACCA
AAACATTACATCTAATATCACAGAGAGTGTTCTCCCTTTGATATTTCTCATACTCTCATAGGGAGATATTGCTTTTAATGTCGCAGTGGGTGTACACCAT
+
BBBFFFFFFFFFFFFFIIIIIFIFIFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFFIIFI<FFIIIIIIIFBFFIFFFFFFFFFB<BBBBBFBFF

1_2.TEST.fastq

@D00472:193:CC9HHACXX:6:2210:1104:1940 2:N:0:TGACCA
CATTCAATTCCATTTGATGATTCCATTTTATTCCATTCAATTATGATTCCATTCGAATCCATTCGATGATTCCATTCAATTCTATTTGATGAAGATTCC
+
BBBFFFFFFFFFFIIFFIIIIIFFIIIIIIIIIIIFFIFIIIIIIIIIIBFFIFFFIIIIIIIIIIIIBIIIIIIIIIIIIIIIIIIFFFFBFFIIIIB
@D00472:193:CC9HHACXX:6:2210:1118:1953 2:N:0:TGACCA
ACTTAGCCAATCCCTCCCACATTTGGCAAAATTAAGTTTCAAGCCTTCTGTTATATATCTTCTTTTGTTCTTTGAAAGTTTAGTAGTCTACAAAATCCA
+
BBBFFFFF<FFFFFFFBFFIIIIIIIFBFFFFIFBFFFFFIFBFFFIFB<BBBBFFFIFFIBFFIFFFIBB<FBBFFFFFFFFB<BFFFFFFBBB<<7<
@D00472:193:CC9HHACXX:6:2210:1244:1968 2:N:0:TGACCA
CTTATTAGGAGTAACATCCTTCTAGGATATGAAGAATATCACAAGGCGTACACACACTGATATTCGGAGAGATATCTCCCTAGGATATAAGGTGTATCAC
+
BBBFFFFFFFFFBFFFIIIIIIIIIIFIFFIFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFBFFBBBBFBBFBBBBBFFF

ADD REPLY • link 5.4 years ago Deleted ▴ 70

2

Entering edit mode

preprocessReads() uses internally the FastqStreamer function from the ShortRead package to stream through large FASTQ files in a memory-efficient manner. The argument batchsize is the number of reads to process in each iteration by the FastqStreamer function. How many reads do your files have? I suggest reducing the number of batchsize. In the example you just sent, if you replace the size of batchsize = 2, it works.

preprocessReads(args=args, Fct="filterFct(fq, cutoff=20, Nexceptions=0)", batchsize=2)

Please let me know if it worked for you.

ADD REPLY • link 5.4 years ago dcassol ▴ 100

1

Entering edit mode

Thanks again for your help! So setting the batchsize=2 worked for the samples above.

Interestingly enough, when I run my full fastq files with batchsize= anything but 1, it seems to fail. Each file has 50+ million reads. Think this is an issue with the paired end reads?

One interesting note, I tried subsetting my fastq files by 1000 lines, into TEST_1 and TEST_2. I then gzipped these files and ran the code above with batchsize= 2.

> preprocessReads(args=args, Fct="filterFct(fq, cutoff=20, Nexceptions=0)", batchsize=2)
0 processed reads written to file: ~/results/TEST_1.fastq_trim.gz 
0 processed reads written to file: ~/results/TEST_2.fastq_trim.gz 
1 processed reads written to file: ~/results/TEST_1.fastq_trim.gz 
1 processed reads written to file: ~/results/TEST_2.fastq_trim.gz 
3 processed reads written to file: ~/results/TEST_1.fastq_trim.gz 
3 processed reads written to file: ~/results/TEST_2.fastq_trim.gz 
Error: logical subscript contains NAs

Which makes me think the problem has to be between reads 4 & 5.

Since reads comprise 4 lines of fastq files, my thinking is that an example of the problem has to be between lines 13-20. So I wrote this:

[user@login raw]$ for i in TEST*
> do 
> echo $i
> head -20 $i | tail -8 
> done
TEST_1.fastq
@D00472:193:CC9HHACXX:6:2210:1334:1917 1:N:0:TGACCA
NATCGTCACAAAGTGTTTCCTGGGAATGCTACTGTCTAGTTTTTATGGGCAGTTATATCCTCTGCTGCCATAGGCCTCAAAGCGGTCCAAATCTCCCCTT
+
#0<FFFFFFFFFFFFBFFIIIIIIFFIIIIIFIIFFIIFBFFFFIFBFFIFIIIFIFIFIIFIFFFFFIIIFIIFFFFFFFFFFFBBFFFFFBFFFFFBB
@D00472:193:CC9HHACXX:6:2210:1259:1935 1:N:0:TGACCA
AAGAAAATAAATCACTATGGTTTTAAGCCATCAAATTGAAGGCAATTTGTTTTGGCAGATTCACGGTGGAATCTCGGGAGTGGTCTGAAAAATAAGTTTC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIBFIFFFFFFFFFBF7<BBBBFFFFBFBBFFBFFF
TEST_2.fastq
@D00472:193:CC9HHACXX:6:2210:1334:1917 2:N:0:TGACCA
CTAAACGGAAGCATGCTCAGGAGCTTCTTTGTGATGTTTGCATTCAACTCACAGAGTTGTACTTTCCTTTTGATAGAGCAGCTTTGAAACCCTCTCTTTC
+
BBBFFFFFFFFFBFFFFIFIIFIIIIIIIIIFFFFFFFFIIBIIFFIIIIFIIIFFFFFFFBFFFF<BFFFBFBBFBFBBFBBFFFFFFBBBBBBBFBBB
@D00472:193:CC9HHACXX:6:2210:1259:1935 2:N:0:TGACCA
CGCATCCTCTTCTGTGCCACTTTCACAGAGAGAAACGGGCCCAGGCGTGCCTCATTTTCCCCTTGCTATCATCTTATTCTGGCCTTAAAGATCTGAATTT
+
BBBFFFFFFFFFFIFFFIIIIIIIIIIIIIFIIFFIIFIIIIIIIIIFFIFFIIIIIIFFFFFFFFFBFFBFFFFFFFBFFBBFFFFFFBBFFFFFFBFF

ADD REPLY • link 5.4 years ago Deleted ▴ 70

2

Entering edit mode

Hi @gocougs,

Could you please share (e.g. via Dropbox or Google Drive) an example with just one set of your paired end fastq files where each contains the first 100,000 reads in order to try reproduce the error?

Thank you very much for all the feedback.

Daniela

ADD REPLY • link 5.4 years ago dcassol ▴ 100

1

Entering edit mode

Thanks so much for your help!

Here is a link I created for drop box. Paired end reads TEST_1.fastq.gz TEST_2.fastq.gz, each 100,000 reads.

Simon

https://www.dropbox.com/sh/s5dw1cyfmikret8/AACWcOV2sX9WSsVMU3JIOMCxa?dl=0

ADD REPLY • link 5.4 years ago Deleted ▴ 70

2

Entering edit mode

Hi Simon,

In the example code/demo data, the reads in each of the fastq files should have the same length. Since the reads in your files have a variable length from 35-100bp (not sure if the reads have been gone through some tail trimming already), I believe that is the issue.

Could you please try this:

filterFct <- function(fq, cutoff=20, Nexceptions=0) {
    qcount <- rowSums(as(quality(fq), "matrix") <= cutoff, na.rm=TRUE)
    fq[qcount <= Nexceptions] 
}
preprocessReads(args=args, Fct="filterFct(fq, cutoff=20, Nexceptions=0)", batchsize=100000)

Please let me know if this works! :)

ADD REPLY • link 5.4 years ago dcassol ▴ 100

2

Entering edit mode

Thanks so much Daniela! That did the trick!

You're the best!

:)

ADD REPLY • link 5.4 years ago Deleted ▴ 70