Question

TEQC package very slow

0

Entering edit mode

nac ▴ 280

@nac-4545

Last seen 11.4 years ago

HI, I am analysing coverage data using TEQC package from bioC for quality assessment of target enrichment experiment . I am using a computer cluster farm to do the analysis and asked for large memory to be allocated, my bam files are 11 Gb in size and it seems that the analysis is taking very long, several hours, and then my session exit. Do I need to ask for this to be put on a long queue, more than 12 hours job? Do people use TEQC with large files? How can I be more efficient with this analysis? these are my commands: #get reads myread<-get.reads("reads.bam",filetype="bam") #get pair reads : at that point this will fail :in the doc it is stated " To run the function can be quite time consuming, depending on the number of reads" myreadpair<-reads2pairs(myread) #drop single reads myread<-myread[!(myread$ID %in% myreadpair$singleReads$ID), , drop=TRUE] I have used efficiently these functions on smaller files with miSeq data, but not yet with HiSeq ... Many thanks for sharing your experience in getting QC for large files efficiently Nathalie > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4 [4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2 [7] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Coverage TEQC Coverage TEQC • 1.4k views

ADD COMMENT • link 13.7 years ago nac ▴ 280

score 0 · Answer 1 · 2012-06-13

HI, This is the error message produced at the myreadpair<-reads2pairs(myread) stage after it running for 7 hours: > readpairs4_2_PigS<-reads2pairs(reads4_2_PigS) [1] "there were 1453928 reads found without matching second read, or whose second read matches to a different chromosome" Error in endoapply(reads, mergefun) : 'FUN' did not produce an endomorphism > Terminated that may help, thanks, On 13/06/12 12:07, nathalie wrote: > HI, > I am analysing coverage data using TEQC package from bioC for quality > assessment of target enrichment experiment . > I am using a computer cluster farm to do the analysis and asked for > large memory to be allocated, my bam files are 11 Gb in size and it > seems that the analysis is taking very long, several hours, and then > my session exit. Do I need to ask for this to be put on a long queue, > more than 12 hours job? Do people use TEQC with large files? How can I > be more efficient with this analysis? > these are my commands: > #get reads > myread<-get.reads("reads.bam",filetype="bam") > #get pair reads : at that point this will fail :in the doc it is > stated " To run the function can be quite time consuming, depending on > the number of reads" > myreadpair<-reads2pairs(myread) > > #drop single reads > myread<-myread[!(myread$ID %in% myreadpair$singleReads$ID), , drop=TRUE] > > > I have used efficiently these functions on smaller files with miSeq > data, but not yet with HiSeq ... > Many thanks for sharing your experience in getting QC for large files > efficiently > Nathalie > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4 > [4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2 > [7] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0 > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.