Question

easyRNASeq Error in counting from C.elegans

0

Entering edit mode

Yonggan Wu ▴ 40

@yonggan-wu-5303

Last seen 9.6 years ago

Dear All and Nico, Have anyone be able to do the counting from C.elegans? Here is what I tried but result in no luck: 1) bam was generate from tophat1.4.1, with ucsc ce6 as reference. For easyRNASeq counting, the *.rda was used as reference (see the code below) obj <- fetchAnnotation(new('RNAseq', organismName="Celegans" ), method="biomaRt") gAnnot <- genomicAnnotation(obj) save(gAnnot,file="ensemble_gAnnot_hsa.rda") Here is the error message: > rnaSeq=easyRNASeq(filesDirectory="/cluster/easyrnaseq", + #organism="Celegans", + organism="Celegans", + chr.sizes=chr.sizes, + readLength=50L, + annotationMethod="rda", + annotationFile="/cluster/database/gtf/cel/ensemble_gAnnot_cel.rda", + #annotationMethod="biomaRt", + format="bam", + outputFormat="RNAseq", + gapped=TRUE,#for tophat bam + normalize=TRUE, + filenames="ucsc.bam", + count="genes", + summarization="geneModels", + validity.check=FALSE, + nbCore=5 + ) Checking arguments... Fetching annotations... Computing gene models... Summarizing counts... Processing ucsc.bam Error in aggregate.data.frame(as.data.frame(x), ...) : no rows to aggregate In addition: Warning messages: 1: In easyRNASeq(filesDirectory = "/cluster/easyrnaseq", : There are 8526 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? 2: In min(match(uniqueClasses, classOrder)) : no non-missing arguments to min; returning Inf 3: In SimpleAtomicList(result) : NAs introduced by coercion 2) I tried the same code with human bam file (except the species was changed to human), it works. 3) I also re-did the alignment with ensemble genome references, but still failed while counting. 4) The annotation files was changed to "gtf", and the gtf was download from ensemble. again, it is not working. 5) The organism was changed to "custom", still not working. All hints on this issue were very much appreciated. -- All The Best, Yonggan ---------------------------------------------------------------------- -------------------- Yonggan Wu, M.S. Bioinformatic Scientist Ocean Ridge Biosciences LLC 10475 Riverside Drive Suite 1 Palm Beach Gardens, FL 33410 Phone : 561-223-3152 Fax : 561-740-8710 yongganw@oceanridgebio.com <mailto:yongganw@oceanridgebio.com> http://www.oceanridgebio.com <http: www.oceanridgebio.com=""/> [[alternative HTML version deleted]]

Alignment Annotation Organism Alignment Annotation Organism • 1.2k views

ADD COMMENT • link updated 11.9 years ago by delhomme@embl.de ★ 1.2k • written 11.9 years ago by Yonggan Wu ▴ 40

score 0 · Answer 1 · 2012-05-25

0

Entering edit mode

delhomme@embl.de ★ 1.2k

@delhommeemblde-3232

Last seen 9.6 years ago

Dear Yonggan, Can you please rerun the command with validity.check=TRUE? That would tell us if there are inconsistencies between the chromosome names retrieved from biomaRt and the ones present in your bam file, which is what I suspect. Can you as well indicate you R and easyRNASeq version, i.e. pasting the output of the sessionInfo() command, once you have loaded easyRNASeq? There has been similar question on the mailing list previously, please see the posts: http://thread.gmane.org/gmane.science.biology.informatics.conductor/38 983 http://thread.gmane.org/gmane.science.biology.informatics.conductor/39 629/focus=39684 as they might be of some help to you. In the development version, (http://bioconductor.org/packages/2.11/bioc/html/easyRNASeq.html, version 1.3.3), the vignette has been updated to contain these use cases. The vignette has been clarified and important points to process RNA-Seq data now stand out. HTH, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On May 24, 2012, at 7:15 PM, Yonggan Wu wrote: > > Dear All and Nico, > > Have anyone be able to do the counting from C.elegans? > > Here is what I tried but result in no luck: > 1) bam was generate from tophat1.4.1, with ucsc ce6 as reference. For easyRNASeq counting, the *.rda was used as reference (see the code below) > obj <- fetchAnnotation(new('RNAseq', > organismName="Celegans" > ), > method="biomaRt") > gAnnot <- genomicAnnotation(obj) > save(gAnnot,file="ensemble_gAnnot_hsa.rda") > Here is the error message: > > rnaSeq=easyRNASeq(filesDirectory="/cluster/easyrnaseq", > + #organism="Celegans", > + organism="Celegans", > + chr.sizes=chr.sizes, > + readLength=50L, > + annotationMethod="rda", > + annotationFile="/cluster/database/gtf/cel/ensemble_gAnnot_cel.rda", > + #annotationMethod="biomaRt", > + format="bam", > + outputFormat="RNAseq", > + gapped=TRUE,#for tophat bam > + normalize=TRUE, > + filenames="ucsc.bam", > + count="genes", > + summarization="geneModels", > + validity.check=FALSE, > + nbCore=5 > + ) > Checking arguments... > Fetching annotations... > Computing gene models... > Summarizing counts... > Processing ucsc.bam > Error in aggregate.data.frame(as.data.frame(x), ...) : > no rows to aggregate > In addition: Warning messages: > 1: In easyRNASeq(filesDirectory = "/cluster/easyrnaseq", : > There are 8526 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? > 2: In min(match(uniqueClasses, classOrder)) : > no non-missing arguments to min; returning Inf > 3: In SimpleAtomicList(result) : NAs introduced by coercion > 2) I tried the same code with human bam file (except the species was changed to human), it works. > 3) I also re-did the alignment with ensemble genome references, but still failed while counting. > 4) The annotation files was changed to "gtf", and the gtf was download from ensemble. again, it is not working. > 5) The organism was changed to "custom", still not working. > > All hints on this issue were very much appreciated. > > -- > All The Best, > Yonggan > -------------------------------------------------------------------- ---------------------- > Yonggan Wu, M.S. > Bioinformatic Scientist > Ocean Ridge Biosciences LLC > 10475 Riverside Drive Suite 1 > Palm Beach Gardens, FL 33410 > Phone : 561-223-3152 > Fax : 561-740-8710 > yongganw at oceanridgebio.com > http://www.oceanridgebio.com >

ADD COMMENT • link 11.9 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Thank you Nico, I checked the two posts, it do give me some hints but did not solve my problem. Here is another run without validity.check, would you please let me know if you spot anything wrong? Thank you for your kindly help. > library(easyRNASeq) > library(BSgenome.Celegans.UCSC.ce10) > chr.sizes=as.list(seqlengths(Celegans)) > setwd("/cluster/easyrnaseq") > rnaSeq=easyRNASeq(filesDirectory="/cluster/easyrnaseq", + organism="Celegans", + #organism="custom", + chr.sizes=chr.sizes, + readLength=50L, + annotationMethod="rda", + annotationFile="/cluster/database/gtf/cel/ensemble_gAnnot_cel.rda", + #annotationMethod="biomaRt", + format="bam", + outputFormat="RNAseq", + gapped=TRUE,#for tophat bam + normalize=TRUE, + filenames="ucsc.bam", + count="genes", + summarization="geneModels", + #validity.check=FALSE, + nbCore=5 + ) Checking arguments... Fetching annotations... Computing gene models... Summarizing counts... Processing ucsc.bam Error in fetchCoverage(obj, format = format, filename = file, filter = filter, : _ The file /cluster/easyrnaseq/ucsc.bam contains reads of different sizes: 540, 50 . We cannot deal with such data at the moment. Please contact the authors to add this functionality._ In addition: Warning messages: 1: In easyRNASeq(filesDirectory = "/cluster/easyrnaseq", : You enforce UCSC chromosome conventions, however the provided chromosome size list is not compliant. Correcting it. 2: In easyRNASeq(filesDirectory = "/cluster/easyrnaseq", : You enforce UCSC chromosome conventions, however the provided annotation is not compliant. Correcting it. 3: In .convertToUCSC(names(genomicAnnotation(obj)), organismName(obj), : _ No function implemented to convert the names for the organism: Celegans. Available ones so far exists for: Dmelanogaster, Hsapiens, Mmusculus, Rnorvegicus._ If you need to provide your own mapping, use the 'custom' argument. 4: In easyRNASeq(filesDirectory = "/cluster/easyrnaseq", : There are 8526 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? 5: In fetchCoverage(obj, format = format, filename = file, filter = filter, : You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it. > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BSgenome.Celegans.UCSC.ce10_1.3.17 easyRNASeq_1.3.2 [3] ShortRead_1.15.5 latticeExtra_0.6-19 [5] RColorBrewer_1.0-5 lattice_0.20-6 [7] Rsamtools_1.9.12 DESeq_1.9.6 [9] locfit_1.5-8 BSgenome_1.25.1 [11] GenomicRanges_1.9.20 Biostrings_2.25.4 [13] IRanges_1.15.10 edgeR_2.7.11 [15] limma_3.13.4 biomaRt_2.13.1 [17] Biobase_2.17.5 genomeIntervals_1.13.0 [19] BiocGenerics_0.3.0 intervals_0.13.3 loaded via a namespace (and not attached): [1] annotate_1.35.2 AnnotationDbi_1.19.8 bitops_1.0-4.1 [4] DBI_0.2-5 genefilter_1.39.0 geneplotter_1.35.0 [7] grid_2.15.0 hwriter_1.3 RCurl_1.91-1 [10] RSQLite_0.11.1 splines_2.15.0 stats4_2.15.0 [13] survival_2.36-14 XML_3.9-4 xtable_1.7-0 [16] zlibbioc_1.3.0 Regarding the reads size error, I checked the length of each small RNA in the bam, they are exactly 50-nt. Best, Yonggan On 05/25/2012 02:55 AM, Nicolas Delhomme wrote: > Dear Yonggan, > > Can you please rerun the command with validity.check=TRUE? That would tell us if there are inconsistencies between the chromosome names retrieved from biomaRt and the ones present in your bam file, which is what I suspect. Can you as well indicate you R and easyRNASeq version, i.e. pasting the output of the sessionInfo() command, once you have loaded easyRNASeq? > > There has been similar question on the mailing list previously, please see the posts: > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 38983 > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 39629/focus=39684 > as they might be of some help to you. > > In the development version, (http://bioconductor.org/packages/2.11/bioc/html/easyRNASeq.html, version 1.3.3), the vignette has been updated to contain these use cases. The vignette has been clarified and important points to process RNA-Seq data now stand out. > > HTH, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On May 24, 2012, at 7:15 PM, Yonggan Wu wrote: > >> Dear All and Nico, >> >> Have anyone be able to do the counting from C.elegans? >> >> Here is what I tried but result in no luck: >> 1) bam was generate from tophat1.4.1, with ucsc ce6 as reference. For easyRNASeq counting, the *.rda was used as reference (see the code below) >> obj<- fetchAnnotation(new('RNAseq', >> organismName="Celegans" >> ), >> method="biomaRt") >> gAnnot<- genomicAnnotation(obj) >> save(gAnnot,file="ensemble_gAnnot_hsa.rda") >> Here is the error message: >>> rnaSeq=easyRNASeq(filesDirectory="/cluster/easyrnaseq", >> + #organism="Celegans", >> + organism="Celegans", >> + chr.sizes=chr.sizes, >> + readLength=50L, >> + annotationMethod="rda", >> + annotationFile="/cluster/database/gtf/cel/ensemble_gAnnot_cel.rda", >> + #annotationMethod="biomaRt", >> + format="bam", >> + outputFormat="RNAseq", >> + gapped=TRUE,#for tophat bam >> + normalize=TRUE, >> + filenames="ucsc.bam", >> + count="genes", >> + summarization="geneModels", >> + validity.check=FALSE, >> + nbCore=5 >> + ) >> Checking arguments... >> Fetching annotations... >> Computing gene models... >> Summarizing counts... >> Processing ucsc.bam >> Error in aggregate.data.frame(as.data.frame(x), ...) : >> no rows to aggregate >> In addition: Warning messages: >> 1: In easyRNASeq(filesDirectory = "/cluster/easyrnaseq", : >> There are 8526 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >> 2: In min(match(uniqueClasses, classOrder)) : >> no non-missing arguments to min; returning Inf >> 3: In SimpleAtomicList(result) : NAs introduced by coercion >> 2) I tried the same code with human bam file (except the species was changed to human), it works. >> 3) I also re-did the alignment with ensemble genome references, but still failed while counting. >> 4) The annotation files was changed to "gtf", and the gtf was download from ensemble. again, it is not working. >> 5) The organism was changed to "custom", still not working. >> >> All hints on this issue were very much appreciated. >> >> -- >> All The Best, >> Yonggan >> ------------------------------------------------------------------- ----------------------- >> Yonggan Wu, M.S. >> Bioinformatic Scientist >> Ocean Ridge Biosciences LLC >> 10475 Riverside Drive Suite 1 >> Palm Beach Gardens, FL 33410 >> Phone : 561-223-3152 >> Fax : 561-740-8710 >> yongganw@oceanridgebio.com >> http://www.oceanridgebio.com >> [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago Yonggan Wu ▴ 40