Question

Multithreading problems with qalign

0

Entering edit mode

kirt.onthank • 0

@kirtonthank-11808

Last seen 9.2 years ago

I am trying to map RNA-seq reads in a fastq file to the Octopus bimaculoides genome (Obimaculoides_280.fa from Metazome) using qalign in the package QuasR. I am running this on a Dell Optiplex 7010 with 32GB of RAM. I am attempting to multithread the command using the following:

aligned=qAlign('targets.txt',genome='Obimaculoides_280.fa',maxHits=1, splicedAlignment=T,clObj=makeCluster(8))

I ran this with one single, small fastq file (containing only 1000 reads), but I get the same behavior with any size of fastq file when mapping to this genome. From this I receive the following output and error with traceback:

alignment files missing - need to:
    create 1 genomic alignment(s)
will start in ..9s..8s..7s..6s..5s..4s..3s..2s..1s
Testing the compute nodes...OK
Loading QuasR on the compute nodes...OK
Available cores:
nodeNames
onthank 
      8 
Performing genomic alignments for 1 samples. See progress in the log file:
/home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/QuasR_log_75d0f207738.txt
[samopen] SAM header is present: 379696 sequences.
Error in checkForRemoteErrors(val) : 
  one node produced an error: Error on onthank processing sample /home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/xab1000.fastq : failed to open SAM/BAM file
  file: '/tmp/RtmpURHs4A/samToBam_2e6c1fe22711/Scaffold1006.sam'
> traceback()
9: stop("one node produced an error: ", firstmsg, domain = NA)
8: checkForRemoteErrors(val)
7: staticClusterApply(cl, fun, length(x), argfun)
6: clusterApply(cl, x = splitList(X, length(cl)), fun = lapply, 
       fun, ...)
5: do.call(c, clusterApply(cl, x = splitList(X, length(cl)), fun = lapply, 
       fun, ...), quote = TRUE)
4: parLapply(clObjNR, paramsListGenomic, createGenomicAlignmentsController)
3: parLapply(clObjNR, paramsListGenomic, createGenomicAlignmentsController)
2: createGenomicAlignments(proj, clObj)
1: qAlign("targets.txt", genome = "Obimaculoides_280.fa", maxHits = 1, 
       splicedAlignment = FALSE, clObj = makeCluster(8))

The contents of the QuasR Log are:

[1] "Executing bowtie on onthank using 8 cores. Parameters:"
[1] "'/home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/Obimaculoides_280.fa.Rbowtie/bowtieIndex' '/home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/xab1000.fastq' -m 1 --best --strata --phred33-quals -S -p 8 '/tmp/RtmpURHs4A/xab1000.fastq2e6cb8dcf29.sam'"
[1] "Converting sam file to sorted bam file on onthank : /tmp/RtmpURHs4A/xab1000.fastq2e6cb8dcf29.sam"
[1] "Error on onthank processing sample /home/onthank/Documents/Octopus_Transcriptome/RNA-Seq_results/octo1_split/xab1000.fastq : failed to open SAM/BAM file\n  file: '/tmp/RtmpURHs4A/samToBam_2e6c1fe22711/Scaffold1006.sam'"

When the process gets to converting the first sam file to bam files, I notice that my RAM starts filling up and I get the error when it finally fills up all 32GB. The R temp folder invariably has 1005 files in the samToBam folder and the contents are 11.0GB. I have tried this on a couple different computers, one with only 24GB of RAM, and the R temp folder still has 1005 files in the samToBam folder and its contents are still 11.0GB. I have tried trimming down my fastq files, (my original six files had 42-69M reads each) but I get this error even if I run a single fastq with only 16 reads. If I set clObj=makeCluster(1), everything works just fine, it just takes about a month to run on my whole dataset.

Here is my sessionInfo:

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gplots_3.0.1               DESeq_1.24.0               lattice_0.20-34            locfit_1.5-9.1            
 [5] GenomicFeatures_1.24.5     AnnotationDbi_1.34.4       QuasR_1.12.0               Rbowtie_1.12.0            
 [9] Rqc_1.6.2                  ggplot2_2.2.0              ShortRead_1.30.0           GenomicAlignments_1.8.4   
[13] SummarizedExperiment_1.2.3 Biobase_2.32.0             Rsamtools_1.24.0           GenomicRanges_1.24.3      
[17] GenomeInfoDb_1.8.7         Biostrings_2.40.2          XVector_0.12.1             IRanges_2.6.1             
[21] S4Vectors_0.10.3           BiocGenerics_0.18.0        BiocParallel_1.6.6         BiocInstaller_1.22.3      

loaded via a namespace (and not attached):
 [1] httr_1.2.1                    AnnotationHub_2.4.2           splines_3.3.2                 gtools_3.5.0                 
 [5] GenomicFiles_1.8.0            Formula_1.2-1                 shiny_0.14.2                  assertthat_0.1               
 [9] interactiveDisplayBase_1.10.3 latticeExtra_0.6-28           BSgenome_1.40.1               RSQLite_1.1                  
[13] biovizBase_1.20.0             digest_0.6.10                 RColorBrewer_1.1-2            colorspace_1.3-1             
[17] htmltools_0.3.5               httpuv_1.3.3                  Matrix_1.2-7.1                plyr_1.8.4                   
[21] XML_3.98-1.5                  biomaRt_2.28.0                genefilter_1.54.2             zlibbioc_1.18.0              
[25] xtable_1.8-2                  scales_0.4.1                  gdata_2.17.0                  htmlTable_1.7                
[29] tibble_1.2                    annotate_1.50.1               nnet_7.3-12                   lazyeval_0.2.0               
[33] survival_2.40-1               magrittr_1.5                  mime_0.5                      memoise_1.0.0                
[37] hwriter_1.3.2                 foreign_0.8-67                tools_3.3.2                   data.table_1.9.8             
[41] BiocStyle_2.0.3               stringr_1.1.0                 munsell_0.4.3                 cluster_2.0.5                
[45] ensembldb_1.4.7               caTools_1.17.1                grid_3.3.2                    RCurl_1.95-4.8               
[49] dichromat_2.0-0               VariantAnnotation_1.18.7      bitops_1.0-6                  gtable_0.2.0                 
[53] DBI_0.5-1                     markdown_0.7.7                reshape2_1.4.2                R6_2.2.0                     
[57] gridExtra_2.2.1               knitr_1.15.1                  rtracklayer_1.32.2            Hmisc_4.0-0                  
[61] KernSmooth_2.23-15            stringi_1.1.2                 Rcpp_0.12.8                   geneplotter_1.50.0           
[65] rpart_4.1-10                  acepack_1.4.1

Thanks in advance for any help on this!

Kirt Onthank, Walla Walla University

quasr parallel • 2.4k views

ADD COMMENT • link updated 9.2 years ago by Hotz, Hans-Rudolf ▴ 400 • written 9.2 years ago by kirt.onthank • 0

0

Entering edit mode

Hi Kirk

I don't have an answer (yet), I am just guessing: The Octopus genome consists of many scaffolds (over 150000, most of them less than 500nt long). This might cause the problems you encounter. What happens if you just test aligning your reads against the human genome with the same parameters?

Also, the aligner used in quasr for "splicedAlignment=T" is very slow. Does it actually make sense to use it given the small size of the scaffolds? How long are your reads?

Regards, Hans-Rudolf

ADD REPLY • link 9.2 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

Hans-Rudolf,

Thanks for the suggestion. I aligned using the small test file (1000 reads) that failed when aligning to the Octopus genome, and instead use the Human genome. That worked perfectly.

I am using splicedAlignment=T because I had run it before with splicedAlignment=F and came away with a pretty low % of reads mapped (2.1-3.6%, depending on the sample), so my hope is to run again allowing spliced alignment to see if that improves somewhat. It is one of several things I am trying.

So, it does appear to be the number of scaffolds, or at least something, in the Octopus genome. Is there any way around this?

Thanks,
Kirt

ADD REPLY • link 9.2 years ago kirt.onthank • 0

score 1 · Answer 1 · 2016-12-05

1

Entering edit mode

Hotz, Hans-Rudolf ▴ 400

@hotz-hans-rudolf-3951

Last seen 5.3 years ago

Switzerland

Hi Kirt

with "SpliceMap" (i.e. the alignment tool used when 'splicedAlignment=T') you don't have much flexibility. I would try to align your reads against the 200 longest scaffolds firts, then allign the un-mapped reads against the next 200 and so on. Though this is probably as slow as doing the alignmentt with just one core.

You can try merging the small scaffoolds into artificial longer scaffolds with many Ns in between.

Or you can try another aligner, e.g. STAR. Tough I don't know how STAR deals with so many scaffolds

Regards, Hans-Rudolf

ADD COMMENT • link 9.2 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

Hans-Rudolf,

Thanks so much for the help and ideas. I will chase those down.

Hypothetically, If I don't use SpliceMap, (say 'splicedAlignment=T' isn't increasing my alignment %), might there be a way around this problem? Or am I still in the same hole?

Thanks again, Kirt

ADD REPLY • link 9.2 years ago kirt.onthank • 0

0

Entering edit mode

sorry, I don't understand your question, now. You have said it worked (though, with a low mapping t rate) when using 'splicedAlignment=F', haven't you?

ADD REPLY • link 9.2 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

What makes it work is setting clObj=makeCluster(1) and only using a single core. The issue with slicedAlignment is simply that if it is set to T, the alignment takes an unmanageable amount of time on a single core.

When I am using more than one core the R temp directory fills up with 11.0 GB of temp files, it appears it is making a different temp .sam file for each scaffold. The final directory size is 1005 files, and fails to find the 1006th (which is what actually throws the error). The R temp directory gets to the same size on several machines I have tried with different amounts of RAM. As the genome has far more than 1005 scaffolds, this makes me think that R is hitting some kinda of limit on temp directory size. I have been looking around to see if there is a R temp directory size limit or something in QuasR, but come up with nothing. Do you know of anything? Any other ideas why the temp directory always fills up to 1005 files at 11.0GB?

ADD REPLY • link 9.2 years ago kirt.onthank • 0