I am trying to map RNA-seq reads in a fastq file to the Octopus bimaculoides genome (Obimaculoides_280.fa from Metazome) using qalign in the package QuasR. I am running this on a Dell Optiplex 7010 with 32GB of RAM. I am attempting to multithread the command using the following:
I ran this with one single, small fastq file (containing only 1000 reads), but I get the same behavior with any size of fastq file when mapping to this genome. From this I receive the following output and error with traceback:
alignment files missing - need to: create 1 genomic alignment(s) will start in ..9s..8s..7s..6s..5s..4s..3s..2s..1s Testing the compute nodes...OK Loading QuasR on the compute nodes...OK Available cores: nodeNames onthank 8 Performing genomic alignments for 1 samples. See progress in the log file: /home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/QuasR_log_75d0f207738.txt [samopen] SAM header is present: 379696 sequences. Error in checkForRemoteErrors(val) : one node produced an error: Error on onthank processing sample /home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/xab1000.fastq : failed to open SAM/BAM file file: '/tmp/RtmpURHs4A/samToBam_2e6c1fe22711/Scaffold1006.sam' > traceback() 9: stop("one node produced an error: ", firstmsg, domain = NA) 8: checkForRemoteErrors(val) 7: staticClusterApply(cl, fun, length(x), argfun) 6: clusterApply(cl, x = splitList(X, length(cl)), fun = lapply, fun, ...) 5: do.call(c, clusterApply(cl, x = splitList(X, length(cl)), fun = lapply, fun, ...), quote = TRUE) 4: parLapply(clObjNR, paramsListGenomic, createGenomicAlignmentsController) 3: parLapply(clObjNR, paramsListGenomic, createGenomicAlignmentsController) 2: createGenomicAlignments(proj, clObj) 1: qAlign("targets.txt", genome = "Obimaculoides_280.fa", maxHits = 1, splicedAlignment = FALSE, clObj = makeCluster(8))
The contents of the QuasR Log are:
 "Executing bowtie on onthank using 8 cores. Parameters:"  "'/home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/Obimaculoides_280.fa.Rbowtie/bowtieIndex' '/home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/xab1000.fastq' -m 1 --best --strata --phred33-quals -S -p 8 '/tmp/RtmpURHs4A/xab1000.fastq2e6cb8dcf29.sam'"  "Converting sam file to sorted bam file on onthank : /tmp/RtmpURHs4A/xab1000.fastq2e6cb8dcf29.sam"  "Error on onthank processing sample /home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/xab1000.fastq : failed to open SAM/BAM file\n file: '/tmp/RtmpURHs4A/samToBam_2e6c1fe22711/Scaffold1006.sam'"
When the process gets to converting the first sam file to bam files, I notice that my RAM starts filling up and I get the error when it finally fills up all 32GB. The R temp folder invariably has 1005 files in the samToBam folder and the contents are 11.0GB. I have tried this on a couple different computers, one with only 24GB of RAM, and the R temp folder still has 1005 files in the samToBam folder and its contents are still 11.0GB. I have tried trimming down my fastq files, (my original six files had 42-69M reads each) but I get this error even if I run a single fastq with only 16 reads. If I set
clObj=makeCluster(1), everything works just fine, it just takes about a month to run on my whole dataset.
Here is my sessionInfo:
R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.1 LTS locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats4 parallel stats graphics grDevices utils datasets methods base other attached packages:  gplots_3.0.1 DESeq_1.24.0 lattice_0.20-34 locfit_1.5-9.1  GenomicFeatures_1.24.5 AnnotationDbi_1.34.4 QuasR_1.12.0 Rbowtie_1.12.0  Rqc_1.6.2 ggplot2_2.2.0 ShortRead_1.30.0 GenomicAlignments_1.8.4  SummarizedExperiment_1.2.3 Biobase_2.32.0 Rsamtools_1.24.0 GenomicRanges_1.24.3  GenomeInfoDb_1.8.7 Biostrings_2.40.2 XVector_0.12.1 IRanges_2.6.1  S4Vectors_0.10.3 BiocGenerics_0.18.0 BiocParallel_1.6.6 BiocInstaller_1.22.3 loaded via a namespace (and not attached):  httr_1.2.1 AnnotationHub_2.4.2 splines_3.3.2 gtools_3.5.0  GenomicFiles_1.8.0 Formula_1.2-1 shiny_0.14.2 assertthat_0.1  interactiveDisplayBase_1.10.3 latticeExtra_0.6-28 BSgenome_1.40.1 RSQLite_1.1  biovizBase_1.20.0 digest_0.6.10 RColorBrewer_1.1-2 colorspace_1.3-1  htmltools_0.3.5 httpuv_1.3.3 Matrix_1.2-7.1 plyr_1.8.4  XML_3.98-1.5 biomaRt_2.28.0 genefilter_1.54.2 zlibbioc_1.18.0  xtable_1.8-2 scales_0.4.1 gdata_2.17.0 htmlTable_1.7  tibble_1.2 annotate_1.50.1 nnet_7.3-12 lazyeval_0.2.0  survival_2.40-1 magrittr_1.5 mime_0.5 memoise_1.0.0  hwriter_1.3.2 foreign_0.8-67 tools_3.3.2 data.table_1.9.8  BiocStyle_2.0.3 stringr_1.1.0 munsell_0.4.3 cluster_2.0.5  ensembldb_1.4.7 caTools_1.17.1 grid_3.3.2 RCurl_1.95-4.8  dichromat_2.0-0 VariantAnnotation_1.18.7 bitops_1.0-6 gtable_0.2.0  DBI_0.5-1 markdown_0.7.7 reshape2_1.4.2 R6_2.2.0  gridExtra_2.2.1 knitr_1.15.1 rtracklayer_1.32.2 Hmisc_4.0-0  KernSmooth_2.23-15 stringi_1.1.2 Rcpp_0.12.8 geneplotter_1.50.0  rpart_4.1-10 acepack_1.4.1
Thanks in advance for any help on this!
Kirt Onthank, Walla Walla University
I don't have an answer (yet), I am just guessing: The Octopus genome consists of many scaffolds (over 150000, most of them less than 500nt long). This might cause the problems you encounter. What happens if you just test aligning your reads against the human genome with the same parameters?
Also, the aligner used in quasr for "splicedAlignment=T" is very slow. Does it actually make sense to use it given the small size of the scaffolds? How long are your reads?
Thanks for the suggestion. I aligned using the small test file (1000 reads) that failed when aligning to the Octopus genome, and instead use the Human genome. That worked perfectly.
I am using splicedAlignment=T because I had run it before with splicedAlignment=F and came away with a pretty low % of reads mapped (2.1-3.6%, depending on the sample), so my hope is to run again allowing spliced alignment to see if that improves somewhat. It is one of several things I am trying.
So, it does appear to be the number of scaffolds, or at least something, in the Octopus genome. Is there any way around this?