HI,
I am analysing coverage data using TEQC package from bioC for quality
assessment of target enrichment experiment .
I am using a computer cluster farm to do the analysis and asked for
large memory to be allocated, my bam files are 11 Gb in size and it
seems that the analysis is taking very long, several hours, and then
my
session exit. Do I need to ask for this to be put on a long queue,
more
than 12 hours job? Do people use TEQC with large files? How can I be
more efficient with this analysis?
these are my commands:
#get reads
myread<-get.reads("reads.bam",filetype="bam")
#get pair reads : at that point this will fail :in the doc it is
stated
" To run the function can be quite time consuming, depending on
the number of reads"
myreadpair<-reads2pairs(myread)
#drop single reads
myread<-myread[!(myread$ID %in% myreadpair$singleReads$ID), ,
drop=TRUE]
I have used efficiently these functions on smaller files with miSeq
data, but not yet with HiSeq ...
Many thanks for sharing your experience in getting QC for large files
efficiently
Nathalie
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4
[4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2
[7] BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
HI,
This is the error message produced at the
myreadpair<-reads2pairs(myread) stage after it running for 7 hours:
> readpairs4_2_PigS<-reads2pairs(reads4_2_PigS)
[1] "there were 1453928 reads found without matching second read, or
whose second read matches to a different chromosome"
Error in endoapply(reads, mergefun) :
'FUN' did not produce an endomorphism
> Terminated
that may help,
thanks,
On 13/06/12 12:07, nathalie wrote:
> HI,
> I am analysing coverage data using TEQC package from bioC for
quality
> assessment of target enrichment experiment .
> I am using a computer cluster farm to do the analysis and asked for
> large memory to be allocated, my bam files are 11 Gb in size and it
> seems that the analysis is taking very long, several hours, and then
> my session exit. Do I need to ask for this to be put on a long
queue,
> more than 12 hours job? Do people use TEQC with large files? How can
I
> be more efficient with this analysis?
> these are my commands:
> #get reads
> myread<-get.reads("reads.bam",filetype="bam")
> #get pair reads : at that point this will fail :in the doc it is
> stated " To run the function can be quite time consuming, depending
on
> the number of reads"
> myreadpair<-reads2pairs(myread)
>
> #drop single reads
> myread<-myread[!(myread$ID %in% myreadpair$singleReads$ID), ,
drop=TRUE]
>
>
> I have used efficiently these functions on smaller files with miSeq
> data, but not yet with HiSeq ...
> Many thanks for sharing your experience in getting QC for large
files
> efficiently
> Nathalie
>
> > sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4
> [4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2
> [7] BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.