easyRNASeq error
4
0
Entering edit mode
@delhommeemblde-3232
Last seen 7.1 years ago
Dear Wade, Thanks for the nice words, that's good to hear! Although I have to say it got me a little worried, because the more people praise, the more difficult the questions to come ;-) I added the bioconductor mailing list in Cc, so that it benefits others. The first error, I was surprised to see. From the parallel package documentation, I did not expect it, but it got well extended since I started using it and is now obvious that the MakeForkCluster would not work on windows. As I don't have access to any Windows machine, I would like to give you the code that I'll write to replace that per email (off list) so that you can try it and tell me if it fixes the issue. Would that be OK? For the second error, you are correct. Note that it is NOT necessary to use the BSgenome for easyRNASeq, that's just that as they are available in Bioconductor it's easy to use them for exemplifying, but they are not mandatory. I need to state that more explicitly in the vignette. Back to your issue, I'm expecting in the alignment file and in the chr.sizes object chromosome names that follow UCSC conventions, e.g. chr1 to chr19 and chrX, chrY and chrM for the M.musculus for example. If it's not what I get, I try to "guess" them but obviously, since every sequencing facility has its own convention, it is not a straightforward business. For that reason, I've implemented a pass-by. To use it, we need to do the following 1) create a data.frame that contains the mapping from your different input. As you're using biomaRt as an annotationMethod, we will have three sets of chromosome names: a) biomaRT: 1:19,"X","Y" b) your alignment file: chr1.fa to chr19.fa, chrX.fa, chrY.fa, etc... c) the chromosome file: chr1 to chr19, chrX, chrY, etc... that we need to combine together. Since c) follows the UCSC convention, we'll use that as our standard. I'm not sure what you want to do about the NM, QC, RM and splice_sites-auto.fa, so I ignore them. From the name, you probably don't want to ignore that last one, but I'm not sure how you can use it. Here we go chr.map <- data.frame(from=c(1:19,"X","Y",paste("chr",c(1:19,"X","Y"), ".fa",sep="")),to=rep(paste("chr",c(1:19,"X","Y"),sep=""),2)) 2) rerun you command with the chr.map argument and a "custom" organism argument rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, organism="custom", chr.sizes=chr.sizes, filter=myfilt, readLength=50L, annotationMethod="biomaRt", format="aln", count="exons", pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", chr.map=chr.map ) Note that I haven't tried that myself, I always had a least two set of chromosomes in agreement, so it might fail. If so, it would be great if I could get access to an excerpt of your data to try it out. Finally, note that there's another way to by-pass this issue, if you are not using biomaRt but a local annotation source (gff, gtf, rda, etc...). If this file would contain the same chromosome names as your alignment file, you could do the following: 1) change the chromosome name in the chr.sizes to add a ".fa" at the end chr.sizes <- as.list(seqlengths(Mmusculus)) names(chr.sizes) <- paste(names(chr.sizes),"fa",sep=".") 2) rerun your command with that new chr.sizes and precise validity.check=FALSE in the easyRNASeq command rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, organism="Mmusculus", chr.sizes=chr.sizes, filter=myfilt, readLength=50L, annotationMethod="biomaRt", format="aln", count="exons", pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", validity.check=FALSE ) You need to be careful about what you're doing there, since you shut off all the consistency check on the chromosome naming. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 29 Feb 2012, at 22:07, Davis, Wade wrote: > Dear Nicolas, > First , I want to thank you for the hard work you have put in making the easyRNASeq package. I am very familiar with R and BioConductor, and NGS statistical models. But I have been learning over the past month how to deal with raw NGS for the first time, and the number of different ways and packages is a bit overwhelming. Fortunately, I discovered your package, and it has simplified things greatly and helped me understand the workflow better to getting to a count table. So once again, thank you! I am sure you will have many users in the coming months. > > I have 2 questions. The first relates to the error below which seems to come from the summarization method I am using, which must involve trying to do some work in parallel. However, since the default nbCore is 1, I am surprised it still tries to use multiple processers (I am on a 8 core machine). Is there anything I can do about this? I really want to use the geneModels.. > My second question follows the output below. > > Thanks for your help and time! > > J. Wade Davis, PhD > Associate Professor, Department of Health Management and Informatics > Co-Director, Biostatistics and Research Design Unit > 187 Galena DC 018.0 > University of Missouri > Columbia, MO 65212 > Phone: (573) 882-0770 > Fax: (573) 884-4196 > www.biostat.missouri.edu > > > > > # run this code only the first time this way to get the annotations; > > # run the subsequent code for future use.... > > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, > + organism="Mmusculus", > + chr.sizes=chr.sizes, > + filter=myfilt, > + readLength=50L, > + annotationMethod="biomaRt", > + # format="SolexaExport", > + format="aln", > + #withId=TRUE, > + count="genes", > + summarization= "geneModels", > + #count="exons", #generates error message > + #filenames=fastqfilenames[1], > + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", > + outputFormat="RNAseq" > + ) > Checking arguments... > Fetching annotations... > Computing gene models... > Error in makeForkCluster(nnodes = nbCore) : > fork clusters are not supported on Windows > > sessionInfo() > R Under development (unstable) (2012-02-28 r58513) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] RnaSeqTutorial_0.0.7 BSgenome.Dmelanogaster.UCSC.dm3_1.3.17 BSgenome.Mmusculus.UCSC.mm9_1.3.17 easyRNASeq_1.1.7 > [5] ShortRead_1.13.13 latticeExtra_0.6-20 RColorBrewer_1.0-5 Rsamtools_1.7.37 > [9] DESeq_1.7.6 locfit_1.5-6 lattice_0.20-0 akima_0.5-7 > [13] BSgenome_1.23.2 GenomicRanges_1.7.28 Biostrings_2.23.6 IRanges_1.13.26 > [17] edgeR_2.5.9 limma_3.11.15 biomaRt_2.11.1 Biobase_2.15.3 > [21] BiocGenerics_0.1.7 genomeIntervals_1.11.0 intervals_0.13.3 > > loaded via a namespace (and not attached): > [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 geneplotter_1.33.1 grid_2.15.0 > [8] hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1 splines_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4.1 > [15] xtable_1.7-0 zlibbioc_1.1.1 > > library(parallel) > > ?makeForkClsuter > No documentation for ?makeForkClsuter? in specified packages and libraries: > you could try ???makeForkClsuter? > > > My second question is about a different error message, which seems very close to one you dealt with on the mailing list a few weeks ago, but the person never said if it was resolved: > > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 38841/focus=38858 > > Here is my code and the error: > > > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, > + organism="Mmusculus", > + chr.sizes=chr.sizes, > + filter=myfilt, > + readLength=50L, > + annotationMethod="biomaRt", > + format="aln", > + count="exons", > + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", > + outputFormat="RNAseq" > + ) > Checking arguments... > Fetching annotations... > Summarizing counts... > Processing 1-Feb_ATCACG_L003_R1_001_export.txt.gz > Error in FUN(c("chr1.fa", "chr10.fa", "chr11.fa", "chr12.fa", "chr13.fa", : > (list) object cannot be coerced to type 'integer' > In addition: Warning message: > In easyRNASeq(filesDirectory = extdataDir, organism = "Mmusculus", : > There are 429316 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? > > What easyRNASeq is using as my chromosome names is different from my Illumina file (see below). I?m sure this is the source of my error, I just don?t know how to change the names and fix this. The file I am dealing with from the sequencing core is aligned I am told as such ?The alignment was run in a splice-aware fashion using as input the mm9 build of the mouse genome from UCSC as provided by Illumina.? > > > From my Illumina export/fastQ file read in using ReadAlign: > > > table(chromosome(aln)) > > chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa > 285307 221943 395265 115402 108108 127673 > chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa > 159310 143801 201871 83037 156612 304649 > chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa > 168498 222071 234236 229869 297021 194308 > chr9.fa chrX.fa chrY.fa NM QC RM > 264232 124942 3 1009813 5877 1074323 > splice_sites-auto.fa > 804227 > > > > > as.list(seqlengths(Mmusculus)) > > > $chr1 > [1] 197195432 > >$chr2 > [1] 181748087 > > $chr3 > [1] 159599783 > >$chr4 > [1] 155630120 > > $chr5 > [1] 152537259 > >$chr6 > [1] 149517037 > > $chr7 > [1] 152524553 > >$chr8 > [1] 131738871 > > $chr9 > [1] 124076172 > >$chr10 > [1] 129993255 > > $chr11 > [1] 121843856 > >$chr12 > [1] 121257530 > > $chr13 > [1] 120284312 > >$chr14 > [1] 125194864 > > $chr15 > [1] 103494974 > >$chr16 > [1] 98319150 > > $chr17 > [1] 95272651 > >$chr18 > [1] 90772031 > > $chr19 > [1] 61342430 > >$chrX > [1] 166650296 > > $chrY > [1] 15902555 > >$chrM > [1] 16299 > > $chr1_random > [1] 1231697 > >$chr3_random > [1] 41899 > > $chr4_random > [1] 160594 > >$chr5_random > [1] 357350 > > $chr7_random > [1] 362490 > >$chr8_random > [1] 849593 > > $chr9_random > [1] 449403 > >$chr13_random > [1] 400311 > > $chr16_random > [1] 3994 > >$chr17_random > [1] 628739 > > $chrX_random > [1] 1785075 > >$chrY_random > [1] 58682461 > > $chrUn_random > [1] 5900358 > > > > ADD COMMENT 0 Entering edit mode @delhommeemblde-3232 Last seen 7.1 years ago Dear Wade, I've committed two changes to easyRNASeq that should solve your first question. The new package 1.1.8 should be available after the next build in bioconductor, in a day or two. You can get the changes from svn directly and install from source if you prefer, see http://bioconductor.org/developers/source-control/. If you're not familiar with that, I could send you a windows zip archive of the package, just let me know. Summarizing by geneModels should work on windows now. I'd be glad if you could run these two tests for me and let me know if they succeed. library("easyRNASeq") library("RnaSeqTutorial") library(BSgenome.Dmelanogaster.UCSC.dm3) rnaSeq <- easyRNASeq(system.file( "extdata", package="RnaSeqTutorial"), organism="Dmelanogaster", chr.sizes=as.list(seqlengths(Dmelanogaster)), readLength=36L, annotationMethod="rda", annotationFile=system.file( "data", "gAnnot.rda", package="RnaSeqTutorial"), format="bam", count="genes", summarization="geneModels", pattern="[A,C,T,G]{6}\\.bam$", outputFormat="RNAseq") rnaSeqPar <- easyRNASeq(system.file( "extdata", package="RnaSeqTutorial"), organism="Dmelanogaster", chr.sizes=as.list(seqlengths(Dmelanogaster)), readLength=36L, annotationMethod="rda", annotationFile=system.file( "data", "gAnnot.rda", package="RnaSeqTutorial"), format="bam", count="genes", summarization="geneModels", pattern="[A,C,T,G]{6}\\.bam", outputFormat="RNAseq", nbCore=2) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 1 Mar 2012, at 17:28, Nicolas Delhomme wrote: > Dear Wade, > > Thanks for the nice words, that's good to hear! Although I have to say it got me a little worried, because the more people praise, the more difficult the questions to come ;-) > > I added the bioconductor mailing list in Cc, so that it benefits others. > > The first error, I was surprised to see. From the parallel package documentation, I did not expect it, but it got well extended since I started using it and is now obvious that the MakeForkCluster would not work on windows. As I don't have access to any Windows machine, I would like to give you the code that I'll write to replace that per email (off list) so that you can try it and tell me if it fixes the issue. Would that be OK? > > > For the second error, you are correct. Note that it is NOT necessary to use the BSgenome for easyRNASeq, that's just that as they are available in Bioconductor it's easy to use them for exemplifying, but they are not mandatory. I need to state that more explicitly in the vignette. > Back to your issue, I'm expecting in the alignment file and in the chr.sizes object chromosome names that follow UCSC conventions, e.g. chr1 to chr19 and chrX, chrY and chrM for the M.musculus for example. If it's not what I get, I try to "guess" them but obviously, since every sequencing facility has its own convention, it is not a straightforward business. For that reason, I've implemented a pass-by. To use it, we need to do the following > > 1) create a data.frame that contains the mapping from your different input. As you're using biomaRt as an annotationMethod, we will have three sets of chromosome names: > > a) biomaRT: 1:19,"X","Y" > b) your alignment file: chr1.fa to chr19.fa, chrX.fa, chrY.fa, etc... > c) the chromosome file: chr1 to chr19, chrX, chrY, etc... > > that we need to combine together. Since c) follows the UCSC convention, we'll use that as our standard. I'm not sure what you want to do about the NM, QC, RM and splice_sites-auto.fa, so I ignore them. From the name, you probably don't want to ignore that last one, but I'm not sure how you can use it. Here we go > > chr.map <- data.frame(from=c(1:19,"X","Y",paste("chr",c(1:19,"X","Y" ),".fa",sep="")),to=rep(paste("chr",c(1:19,"X","Y"),sep=""),2)) > > 2) rerun you command with the chr.map argument and a "custom" organism argument > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, > organism="custom", > chr.sizes=chr.sizes, > filter=myfilt, > readLength=50L, > annotationMethod="biomaRt", > format="aln", > count="exons", > pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz", > outputFormat="RNAseq", > chr.map=chr.map > ) > > Note that I haven't tried that myself, I always had a least two set of chromosomes in agreement, so it might fail. If so, it would be great if I could get access to an excerpt of your data to try it out. > > Finally, note that there's another way to by-pass this issue, if you are not using biomaRt but a local annotation source (gff, gtf, rda, etc...). If this file would contain the same chromosome names as your alignment file, you could do the following: > > 1) change the chromosome name in the chr.sizes to add a ".fa" at the end > chr.sizes <- as.list(seqlengths(Mmusculus)) > names(chr.sizes) <- paste(names(chr.sizes),"fa",sep=".") > > 2) rerun your command with that new chr.sizes > and precise validity.check=FALSE in the easyRNASeq command > > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, > organism="Mmusculus", > chr.sizes=chr.sizes, > filter=myfilt, > readLength=50L, > annotationMethod="biomaRt", > format="aln", > count="exons", > pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", > outputFormat="RNAseq", > validity.check=FALSE > ) > > You need to be careful about what you're doing there, since you shut off all the consistency check on the chromosome naming. > > > Cheers, > > Nico > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 29 Feb 2012, at 22:07, Davis, Wade wrote: > >> Dear Nicolas, >> First , I want to thank you for the hard work you have put in making the easyRNASeq package. I am very familiar with R and BioConductor, and NGS statistical models. But I have been learning over the past month how to deal with raw NGS for the first time, and the number of different ways and packages is a bit overwhelming. Fortunately, I discovered your package, and it has simplified things greatly and helped me understand the workflow better to getting to a count table. So once again, thank you! I am sure you will have many users in the coming months. >> >> I have 2 questions. The first relates to the error below which seems to come from the summarization method I am using, which must involve trying to do some work in parallel. However, since the default nbCore is 1, I am surprised it still tries to use multiple processers (I am on a 8 core machine). Is there anything I can do about this? I really want to use the geneModels.. >> My second question follows the output below. >> >> Thanks for your help and time! >> >> J. Wade Davis, PhD >> Associate Professor, Department of Health Management and Informatics >> Co-Director, Biostatistics and Research Design Unit >> 187 Galena DC 018.0 >> University of Missouri >> Columbia, MO 65212 >> Phone: (573) 882-0770 >> Fax: (573) 884-4196 >> www.biostat.missouri.edu >> >> >> >>> # run this code only the first time this way to get the annotations; >>> # run the subsequent code for future use.... >>> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, >> + organism="Mmusculus", >> + chr.sizes=chr.sizes, >> + filter=myfilt, >> + readLength=50L, >> + annotationMethod="biomaRt", >> + # format="SolexaExport", >> + format="aln", >> + #withId=TRUE, >> + count="genes", >> + summarization= "geneModels", >> + #count="exons", #generates error message >> + #filenames=fastqfilenames[1], >> + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", >> + outputFormat="RNAseq" >> + ) >> Checking arguments... >> Fetching annotations... >> Computing gene models... >> Error in makeForkCluster(nnodes = nbCore) : >> fork clusters are not supported on Windows >>> sessionInfo() >> R Under development (unstable) (2012-02-28 r58513) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] RnaSeqTutorial_0.0.7 BSgenome.Dmelanogaster.UCSC.dm3_1.3.17 BSgenome.Mmusculus.UCSC.mm9_1.3.17 easyRNASeq_1.1.7 >> [5] ShortRead_1.13.13 latticeExtra_0.6-20 RColorBrewer_1.0-5 Rsamtools_1.7.37 >> [9] DESeq_1.7.6 locfit_1.5-6 lattice_0.20-0 akima_0.5-7 >> [13] BSgenome_1.23.2 GenomicRanges_1.7.28 Biostrings_2.23.6 IRanges_1.13.26 >> [17] edgeR_2.5.9 limma_3.11.15 biomaRt_2.11.1 Biobase_2.15.3 >> [21] BiocGenerics_0.1.7 genomeIntervals_1.11.0 intervals_0.13.3 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 geneplotter_1.33.1 grid_2.15.0 >> [8] hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1 splines_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4.1 >> [15] xtable_1.7-0 zlibbioc_1.1.1 >>> library(parallel) >>> ?makeForkClsuter >> No documentation for ?makeForkClsuter? in specified packages and libraries: >> you could try ???makeForkClsuter? >> >> >> My second question is about a different error message, which seems very close to one you dealt with on the mailing list a few weeks ago, but the person never said if it was resolved: >> >> http://thread.gmane.org/gmane.science.biology.informatics.conductor /38841/focus=38858 >> >> Here is my code and the error: >> >> >> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, >> + organism="Mmusculus", >> + chr.sizes=chr.sizes, >> + filter=myfilt, >> + readLength=50L, >> + annotationMethod="biomaRt", >> + format="aln", >> + count="exons", >> + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz", >> + outputFormat="RNAseq" >> + ) >> Checking arguments... >> Fetching annotations... >> Summarizing counts... >> Processing 1-Feb_ATCACG_L003_R1_001_export.txt.gz >> Error in FUN(c("chr1.fa", "chr10.fa", "chr11.fa", "chr12.fa", "chr13.fa", : >> (list) object cannot be coerced to type 'integer' >> In addition: Warning message: >> In easyRNASeq(filesDirectory = extdataDir, organism = "Mmusculus", : >> There are 429316 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >> >> What easyRNASeq is using as my chromosome names is different from my Illumina file (see below). I?m sure this is the source of my error, I just don?t know how to change the names and fix this. The file I am dealing with from the sequencing core is aligned I am told as such ?The alignment was run in a splice-aware fashion using as input the mm9 build of the mouse genome from UCSC as provided by Illumina.? >> >> >> From my Illumina export/fastQ file read in using ReadAlign: >> >> >> table(chromosome(aln)) >> >> chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa >> 285307 221943 395265 115402 108108 127673 >> chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa >> 159310 143801 201871 83037 156612 304649 >> chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa >> 168498 222071 234236 229869 297021 194308 >> chr9.fa chrX.fa chrY.fa NM QC RM >> 264232 124942 3 1009813 5877 1074323 >> splice_sites-auto.fa >> 804227 >> >> >> >> >> as.list(seqlengths(Mmusculus)) >> >> >>chr1 >> [1] 197195432 >> >> $chr2 >> [1] 181748087 >> >>$chr3 >> [1] 159599783 >> >> $chr4 >> [1] 155630120 >> >>$chr5 >> [1] 152537259 >> >> $chr6 >> [1] 149517037 >> >>$chr7 >> [1] 152524553 >> >> $chr8 >> [1] 131738871 >> >>$chr9 >> [1] 124076172 >> >> $chr10 >> [1] 129993255 >> >>$chr11 >> [1] 121843856 >> >> $chr12 >> [1] 121257530 >> >>$chr13 >> [1] 120284312 >> >> $chr14 >> [1] 125194864 >> >>$chr15 >> [1] 103494974 >> >> $chr16 >> [1] 98319150 >> >>$chr17 >> [1] 95272651 >> >> $chr18 >> [1] 90772031 >> >>$chr19 >> [1] 61342430 >> >> $chrX >> [1] 166650296 >> >>$chrY >> [1] 15902555 >> >> $chrM >> [1] 16299 >> >>$chr1_random >> [1] 1231697 >> >> $chr3_random >> [1] 41899 >> >>$chr4_random >> [1] 160594 >> >> $chr5_random >> [1] 357350 >> >>$chr7_random >> [1] 362490 >> >> $chr8_random >> [1] 849593 >> >>$chr9_random >> [1] 449403 >> >> $chr13_random >> [1] 400311 >> >>$chr16_random >> [1] 3994 >> >> $chr17_random >> [1] 628739 >> >>$chrX_random >> [1] 1785075 >> >> $chrY_random >> [1] 58682461 >> >>$chrUn_random >> [1] 5900358 >> >> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
Entering edit mode
@delhommeemblde-3232
Last seen 7.1 years ago
Hi Wade, I'll send you the package off-list. And discuss how we can do for the data as it's still interesting for me to enhance my chromosome name validity checks. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 1 Mar 2012, at 23:14, Davis, Wade wrote: > Hi Nico, > If you could send me a windows zip that I could try out the changes now instead of waiting for them to get pushed through in a few days. But I don't mind waiting, whatever is easiest for you. > > I've been in a workshop all days, so I haven't gotten a chance to try out the suggestions for the second problem, but I will do so and get back to do. > > Also, you asked for a sample file to help with the second problem. I'm not sure if you still want or need that, but below is the first 60 lines or so. I can send you an entire file (compressed which is about 455MB) if you think that would be helpful. > > Thanks! > > Wade > > HWI-ST538 160 7 1101 1635 1954 ACAGTG 1 GATATCTNNNNNNGAAGAAAACGAGGGAGATTCCAAGGCCGAGCAGCTACACCNGGTACAGAAGCAGTCT GCCACTTTGCCCTCGCCCATCACCCACTCC > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 1661 1972 ACAGTG 1 GGATAGCATNNNAAATGTAAATGAGGAAAATACCTAATAAAAAATATTAAAAAAAAAAAAGACACCGCCA GGAACTTGTGAGAAATCCAAATTCTACATT > [[[______BBBQQX_^^^^_^___^]^_^^^^[^^^^^^]^^^^^^^^^^^^^^^^BBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 1949 1944 ACAGTG 1 AGGGGTTNNNNNNTTCAAGACCTCTCCCCTGNNGAGAAAGAAACCCAAGCAGNNGCCTGTGGGAGGGAAC ACGGACCTGACACCTCTGCCTCAGACTCCT > [[XZY_^BBBBBBQQ[\Z\]_[[Z^^]^^[^BBOL[\X^^PY]ZX\^BBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 1820 1948 ACAGTG 1 GTCAGCTNNNNNNAGGGTCCGATCGTATTGGANGAGGATATCTTTGATCTCCNNCTTGGAGGCTTCATCC ACATATTTTTGGTAATAAGCTACCAGGGCC > [[[^___BBBBBBRS]_]^__^__^^^^^^^^BO[\^^^^^^^^^\^^^^^]BBLW\\^]_^X^^]\] ]]]^^]]]]]^$VZZ\\\\Y[\\\[\[[Y[[ QC N > HWI-ST538 160 7 1101 1773 1957 ACAGTG 1 CTTTGGTTNNNNNGACCTCGCTTGTAGCCAGCAAAAATGGCCTTGCACCACAGCCTTCCAGACATACTTG CTGTTAGAAGTCCTGAGATCGGAAGAGCAC > WZZ_[$_BBBBBRQY_^^\Z^__^^^^^^]^]^]^^^^^^]^^^]]X]]]^^^^^^^___[^^]]^^ ^]^^W\\]]]\]V\\\][[[\$[Z[Y[[WZW QC N > HWI-ST538 160 7 1101 1971 1962 ACAGTG 1 CTTTGTTTNNNNTCTGGTCACAATAGTTGGGGGCAGAAAAGACAGTGACACAGCGGCCACCATGGGCCAC CTCGTAGCCCTCGGCTTTGACTTCATGGCT > [[[_____BBBBRR_____^__^________^^^^^^^^^^^^^^^^^^^^^^[^^\\[[^[[[\[[[ [[[[[[W[[[\\[[WY[[U[YT[[\[YZSYW[ QC N > HWI-ST538 160 7 1101 1878 1976 ACAGTG 1 CCGCTGTCTTNCTCAGGAAGCACGCATTTCCACAGGCCATAGGGCTTTCCTACCACAGTTTGCCACAATC TCAATGTCCCATCTTCAGAACCGCTGGCAT > ^W^\cc\^BQQJY[YRb^_dUccaSccaIYHOWOXRacQaaQabd\bbcbbcc\^bccc]Z^c _]R_UZZ]Y\ZT]]BBBBBBBBBBBB RM N > HWI-ST538 160 7 1101 1934 1999 ACAGTG 1 CTCCACTGAAGTACTTCAGGTAGTAGTCCATGACGCCCTTGTCGTCGGTTCGCGACTCCTCATCACTGTC CCCGGCCCGGCCCACCGATGACATCATCTG > bbbeeeeegggeghiiiiiigghgfiiiiihihiihiihiihhfghdgZeefhadeecdbdbbccbb bbaQX[acX[[acO[aOEHOGYRY]YR] chr10.fa 80607676 R 100 500 Y > HWI-ST538 160 7 1101 2013 1948 ACAGTG 1 CACATCTNNNNNNCCATGGAGCAGATTGAAGANGAAATCAAAGGTTGTTTGGNNTTTCTTCGCACAGTAT ATAGTGTCTTTGGATTTTCATTTAAATTGA > [[[___^BBBBBBRSX_____^_^_^_^__^^BP[^^^^^^^^^\^^^^^^[BBLW\^^__]^^^^^Z$]]]^\Z\]]^^][[\\\\\\\\\\\$\Y[ QC N > HWI-ST538 160 7 1101 2064 1957 ACAGTG 1 AGAAAGTTNNNNNATAGAAGATGTATAAGTTGATCGTAACGGAAGCGTGGATAAGATGCTCGGATCCATA GGAATGTTGATGATAGTAGTAGAGCTTCTA > [[[[^[\_BBBBBQQ\[\^]__^[_]_^^^_^^__^[^^[[W^^[^^\^^\^^^]^^^___[^[XW]Z ]^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 2129 1991 ACAGTG 1 GTAATATTCATTTAGCCTTCTGAGCTTTCTGGGCAGACTTGGTGACTTTGCCAGCTCCAGCAGCCTTCTT GTCCACAGCTTTGATGACACCCACAGCAAC > ___cecceggfdhfhfgbfbefdghggfdff]eZegg]eIYYYceg^cedfce^ebgfce_ccfh dfdZbdbVdb]Za^ZaZ]__bbaa_PTGT^ RM Y > HWI-ST538 160 7 1101 2267 1959 ACAGTG 1 CAGCACAGNNNNGATGGCTACGTACATGGCTGGGGTGTTGAAGGTCTCAAACATGATCTGGGTCATCTTT TCACGGTTGGCCTTAGGGTTCAGGGGGGCC > [[[__^^^BBBBQQ_\^[^]^^^^^[^_^[^^^^^\Z\^RU$]UY]WXZ^]Z^]Z]^_^^]ZZV^BB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 2466 1999 ACAGTG 1 CGTACTTCTCCCACTCGCTTGTGCTCTGTCACTAGAAGGTTATTCTTGAAAGGTTCTGTGCCTTTTTCTT CAGGCTTCATCTGTCATACCTCTACTTGAT > bbaeeeeegggggiiifiiiihiiiiihhhiiihighiiafgfhiiifaehghXcghhhhbghiiiif hhfdghhgfedgeddcbeeccbdddc_bcc chr7.fa 54159980 R 100 504 Y > HWI-ST538 160 7 1101 2566 1981 ACAGTG 1 ATTGAACTCANCTCCTTTTAGGTTGCCCTACTAACTGGACATTGTGGACACTAAATCATATTGGGGTGAT GAATAATCAAATTTTCAATATCAAACAATG > abaceeeeggBQeghhiifhhiiigihifhihfhhihhiifhffdgffhifhihfhhhiihdgfgZd ^bd^ade^abddc]bc__bcdc_c_bbabb chr10.fa 66474748 R 10G89 497 Y > HWI-ST538 160 7 1101 2943 1957 ACAGTG 1 TCCGAATANNNNNCGGTCAAAATAAAAGCTCAAGATGACATCAGTCCCATTTGTCCTAAGTCCTGGTGTT GTATGGATGGTAAGCAGCAGCCAATTATGG > [[[^^^^_BBBBBQQ_\_^___^_^_^^^^^^^^^^^^^^^^^^\^^^^^^^^^^^^^__\]^]^_]^ ^^^^_^_^^^^^W\]Z]]][[^[[[[W[\\$QC N > HWI-ST538 160 7 1101 2883 1967 ACAGTG 1 ATCAGACTCNNNCTGCACGTTCACACCATAAGTCTCATCAATGTTGTCATCCATATTCTGTATCTCCTTG TCGCCACCGTAGTCTGTGATCTTCTTGCCC > [[[__^^^^BBBR[T]]^^][]]^^[X][]]W^]^^^^X[Y\[YRY^[^^^^[]^^^^]^\Y]\Z^^[ [ZU\\^BBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 3133 1960 ACAGTG 1 CTCTCAAANNNNCTACAGAGCCCAGGAAGATTTCAGGATGAAGAGCTTCCTCCTCTTCCTCACTATCATT CTTCTGGTTGTGATTCAGATACAAACAGGA > [[[^Z]_]BBBBSQ\_Z^_^_^^_^^^]_^^^^\^^^[Z]^^^^^^]^^^^^^]^]^^_]^ZZY^^_\ ]^$^[]Z\^^Z\\]\\\\\\\]]]]$[VQV QC N > HWI-ST538 160 7 1101 3023 1964 ACAGTG 1 GGCCACTCTNNNTTGTGCCTTCCACACCTTTTGGTGCAGATTGTTCAGTGGGGAGCTGCCTCATATTCTC TGACTGGCCAGAAAGGCTCTCGCTGGGGTT > [[[__^^^^BBBRQ_\_^^___^^^^_^^__^]^^^^^^^^^^^]\^^]^^^^^^^^^T]^^ZZ^\^] ^^^]\$]^BBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 3219 1966 ACAGTG 1 CAGATGTGTNNNAATGCTAGGTGTGGTTGGTTTATTCCTAGCGTCACTATTATCAGGCCTAGTTGGCTTG ATGTAGAGAAGGCAATGATTTTTTTGATGT > [[[^_____BBBS[]___^__[_]_^\__^Z]^^^^^^^^^^^^^^^^^^^^^^]^^^^_[^^^^___ ^[^^]]]\\]\]]^^[[\\\\\\\\YXTW[\\ QC N > HWI-ST538 160 7 1101 3047 1993 ACAGTG 1 CTCTTGTGTCCGTTTTCTTGCTGTAGCGTTTTCTCCGTACCACTCACTGATACCTGTCTATCGACTACGT GGGATGGCAGAGGCACTGCCAGGCTCCACA > bbbeeedegggfghiiiihhiihifhiieggihffh[affgcgfebb]eefdfh[bgbcdfe_fW^\] aaZaabZ^a_GWWW^^bS^W^^BBB chr1.fa 154218045 F 100 36 Y > HWI-ST538 160 7 1101 3462 1958 ACAGTG 1 CCACTCGTNNNNCTTGCGATCATTGTGCTTAGGGTCTTCTGATACATCTGGGGGAACATCTGTTTTATCT GAACAAGGTCAATCTCACTTCGAGTGACCA > [[[^^^_\BBBBQQ[ZZ^_[Y_[_]_\W^^^^[^M[]]]]\]]^^^^]\^^^FZF\\]Z^^^X\^^^] ]^]\Z\]ZOZXZZ[\\\\TZU\[ZTTW[\[\\ QC N > HWI-ST538 160 7 1101 3261 1972 ACAGTG 1 CCTGGATGTNNNCGTGGATCGGGTCTTGAACGTGGACTTGCCGCACCTTGTATCTCAAGCCCAGCGCCAC ACCCTTGCTGACTTTCTGAACACTACTGTC > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 3390 1979 ACAGTG 1 TGCAGTTCTTNCCCGTGGCTCCTGAGCCAGCAATGTCGTCTGGGAAGCCATCATCAGCTCATTCAGCTGA GACGTGCTAGAGGTCTCCTCAGCTTGGAGC > bbbeeeeeggBRbfgfhihiiihhhhiiiiiehhichhgggfghbgedgdffhffdefffbd]bdfe dceecdZZ_bb_aL[bccb]b]b_[ chr1.fa 187111962 F 10T89 497 Y > HWI-ST538 160 7 1101 3666 1959 ACAGTG 1 CCGAACACNNNNCCCGGATCTCATCTCACAGAATCGTGGGTACGTCTGAACCAGCTGGTTGCCCCCAAAA ATTACCTGTTTCACGCCCCTGGTGCAGCCC > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > HWI-ST538 160 7 1101 3667 1975 ACAGTG 1 AGAACACCGANGTCCACCTTCAAGTTAGGGGAAATGGCATATTCTTCGATCGGCTTGCTGATTTGGGAAA TTTTATCCTTGACTGCTTCTGGAGCAGGAC > PZZTTP_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB chr4.fa 99634649 R 10G3T10C4C19T1T13A10AG3T17 77 N > HWI-ST538 160 7 1101 3986 1969 ACAGTG 1 ATTAAATCTNNNCATGCTCTGAAAAAGGCTTTAGGTCACTCCAAGCTTGGCAGTTAACATTTGGCATGGA CACTGGTAAAACCACAATAGAGAAAGAAGT > [[[^_____BBBRY__^^^__^^^^^_^^__^^^^^^^^^]^^^^^^^]^^^^\^^^^____]^____ [^^[^^^^Z^Z\\\\^[^^^]]\[\\\\[[[R QC N > HWI-ST538 160 7 1101 4230 1977 ACAGTG 1 CGCCTCCGGGNCGTGGCACTCTGGGGCTCTGCCGTCGACATGGGCGCCGCCGCGTGGGCACCGCCACACC TGCTGCTGCGGGCGGCTTTTCTTCTTCTGC > bbbeeeeegfBQbfha_fdeedfhfdffdcfc^_aZ__fg][a[[aQQTHTXaQX[^TOWW]a aaXX^BBBBBBBBBBBBBBBBBBBBBBBBBBB chr2.fa 153067284 F 10C73T4C2G7 372 Y > HWI-ST538 160 7 1101 4363 1955 ACAGTG 1 CGAATGGANNNNNGTGTATACATCTGTGACTGAGACTGAGGCTGCACATGCAAAAAATCTTCTGAAAACA GACAGGCGGGTGCAGCTTGATGGGTCAACT > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC N > > > > > > -----Original Message----- > From: Nicolas Delhomme [mailto:delhomme at embl.de] > Sent: Thursday, March 01, 2012 2:24 PM > To: Nicolas Delhomme > Cc: Davis, Wade; Bioconductor mailing list > Subject: Re: [BioC] easyRNASeq error > > Dear Wade, > > I've committed two changes to easyRNASeq that should solve your first question. The new package 1.1.8 should be available after the next build in bioconductor, in a day or two. You can get the changes from svn directly and install from source if you prefer, see http://bioconductor.org/developers/source-control/. If you're not familiar with that, I could send you a windows zip archive of the package, just let me know. > > Summarizing by geneModels should work on windows now. I'd be glad if you could run these two tests for me and let me know if they succeed. > > library("easyRNASeq") > library("RnaSeqTutorial") > library(BSgenome.Dmelanogaster.UCSC.dm3) > > rnaSeq <- easyRNASeq(system.file( > "extdata", > package="RnaSeqTutorial"), > organism="Dmelanogaster", > chr.sizes=as.list(seqlengths(Dmelanogaster)), > readLength=36L, > annotationMethod="rda", > annotationFile=system.file( > "data", > "gAnnot.rda", > package="RnaSeqTutorial"), > format="bam", > count="genes", > summarization="geneModels", > pattern="[A,C,T,G]{6}\\.bam$", > outputFormat="RNAseq") > > rnaSeqPar <- easyRNASeq(system.file( > "extdata", > package="RnaSeqTutorial"), > organism="Dmelanogaster", > chr.sizes=as.list(seqlengths(Dmelanogaster)), > readLength=36L, > annotationMethod="rda", > annotationFile=system.file( > "data", > "gAnnot.rda", > package="RnaSeqTutorial"), > format="bam", > count="genes", > summarization="geneModels", > pattern="[A,C,T,G]{6}\\.bam$", > outputFormat="RNAseq", > nbCore=2) > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 1 Mar 2012, at 17:28, Nicolas Delhomme wrote: > >> Dear Wade, >> >> Thanks for the nice words, that's good to hear! Although I have to say >> it got me a little worried, because the more people praise, the more >> difficult the questions to come ;-) >> >> I added the bioconductor mailing list in Cc, so that it benefits others. >> >> The first error, I was surprised to see. From the parallel package documentation, I did not expect it, but it got well extended since I started using it and is now obvious that the MakeForkCluster would not work on windows. As I don't have access to any Windows machine, I would like to give you the code that I'll write to replace that per email (off list) so that you can try it and tell me if it fixes the issue. Would that be OK? >> >> >> For the second error, you are correct. Note that it is NOT necessary to use the BSgenome for easyRNASeq, that's just that as they are available in Bioconductor it's easy to use them for exemplifying, but they are not mandatory. I need to state that more explicitly in the vignette. >> Back to your issue, I'm expecting in the alignment file and in the >> chr.sizes object chromosome names that follow UCSC conventions, e.g. >> chr1 to chr19 and chrX, chrY and chrM for the M.musculus for example. >> If it's not what I get, I try to "guess" them but obviously, since >> every sequencing facility has its own convention, it is not a >> straightforward business. For that reason, I've implemented a pass- by. >> To use it, we need to do the following >> >> 1) create a data.frame that contains the mapping from your different input. As you're using biomaRt as an annotationMethod, we will have three sets of chromosome names: >> >> a) biomaRT: 1:19,"X","Y" >> b) your alignment file: chr1.fa to chr19.fa, chrX.fa, chrY.fa, etc... >> c) the chromosome file: chr1 to chr19, chrX, chrY, etc... >> >> that we need to combine together. Since c) follows the UCSC >> convention, we'll use that as our standard. I'm not sure what you want >> to do about the NM, QC, RM and splice_sites-auto.fa, so I ignore them. >> From the name, you probably don't want to ignore that last one, but >> I'm not sure how you can use it. Here we go >> >> chr.map <- >> data.frame(from=c(1:19,"X","Y",paste("chr",c(1:19,"X","Y"),".fa",sep=" >> ")),to=rep(paste("chr",c(1:19,"X","Y"),sep=""),2)) >> >> 2) rerun you command with the chr.map argument and a "custom" organism >> argument >> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, >> organism="custom", >> chr.sizes=chr.sizes, >> filter=myfilt, >> readLength=50L, >> annotationMethod="biomaRt", >> format="aln", >> count="exons", >> pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz", >> outputFormat="RNAseq", >> chr.map=chr.map >> ) >> >> Note that I haven't tried that myself, I always had a least two set of chromosomes in agreement, so it might fail. If so, it would be great if I could get access to an excerpt of your data to try it out. >> >> Finally, note that there's another way to by-pass this issue, if you are not using biomaRt but a local annotation source (gff, gtf, rda, etc...). If this file would contain the same chromosome names as your alignment file, you could do the following: >> >> 1) change the chromosome name in the chr.sizes to add a ".fa" at the >> end chr.sizes <- as.list(seqlengths(Mmusculus)) >> names(chr.sizes) <- paste(names(chr.sizes),"fa",sep=".") >> >> 2) rerun your command with that new chr.sizes and precise >> validity.check=FALSE in the easyRNASeq command >> >> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, >> organism="Mmusculus", >> chr.sizes=chr.sizes, >> filter=myfilt, >> readLength=50L, >> annotationMethod="biomaRt", >> format="aln", >> count="exons", >> pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz", >> outputFormat="RNAseq", >> validity.check=FALSE >> ) >> >> You need to be careful about what you're doing there, since you shut off all the consistency check on the chromosome naming. >> >> >> Cheers, >> >> Nico >> >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 29 Feb 2012, at 22:07, Davis, Wade wrote: >> >>> Dear Nicolas, >>> First , I want to thank you for the hard work you have put in making the easyRNASeq package. I am very familiar with R and BioConductor, and NGS statistical models. But I have been learning over the past month how to deal with raw NGS for the first time, and the number of different ways and packages is a bit overwhelming. Fortunately, I discovered your package, and it has simplified things greatly and helped me understand the workflow better to getting to a count table. So once again, thank you! I am sure you will have many users in the coming months. >>> >>> I have 2 questions. The first relates to the error below which seems to come from the summarization method I am using, which must involve trying to do some work in parallel. However, since the default nbCore is 1, I am surprised it still tries to use multiple processers (I am on a 8 core machine). Is there anything I can do about this? I really want to use the geneModels.. >>> My second question follows the output below. >>> >>> Thanks for your help and time! >>> >>> J. Wade Davis, PhD >>> Associate Professor, Department of Health Management and Informatics >>> Co-Director, Biostatistics and Research Design Unit >>> 187 Galena DC 018.0 >>> University of Missouri >>> Columbia, MO 65212 >>> Phone: (573) 882-0770 >>> Fax: (573) 884-4196 >>> www.biostat.missouri.edu >>> >>> >>> >>>> # run this code only the first time this way to get the annotations; >>>> # run the subsequent code for future use.... >>>> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, >>> + organism="Mmusculus", >>> + chr.sizes=chr.sizes, >>> + filter=myfilt, >>> + readLength=50L, >>> + annotationMethod="biomaRt", >>> + # format="SolexaExport", >>> + format="aln", >>> + #withId=TRUE, >>> + count="genes", >>> + summarization= "geneModels", >>> + #count="exons", #generates error message >>> + #filenames=fastqfilenames[1], >>> + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", >>> + outputFormat="RNAseq" >>> + ) >>> Checking arguments... >>> Fetching annotations... >>> Computing gene models... >>> Error in makeForkCluster(nnodes = nbCore) : >>> fork clusters are not supported on Windows >>>> sessionInfo() >>> R Under development (unstable) (2012-02-28 r58513) >>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >>> [5] LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] RnaSeqTutorial_0.0.7 BSgenome.Dmelanogaster.UCSC.dm3_1.3.17 BSgenome.Mmusculus.UCSC.mm9_1.3.17 easyRNASeq_1.1.7 >>> [5] ShortRead_1.13.13 latticeExtra_0.6-20 RColorBrewer_1.0-5 Rsamtools_1.7.37 >>> [9] DESeq_1.7.6 locfit_1.5-6 lattice_0.20-0 akima_0.5-7 >>> [13] BSgenome_1.23.2 GenomicRanges_1.7.28 Biostrings_2.23.6 IRanges_1.13.26 >>> [17] edgeR_2.5.9 limma_3.11.15 biomaRt_2.11.1 Biobase_2.15.3 >>> [21] BiocGenerics_0.1.7 genomeIntervals_1.11.0 intervals_0.13.3 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 geneplotter_1.33.1 grid_2.15.0 >>> [8] hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1 splines_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4.1 >>> [15] xtable_1.7-0 zlibbioc_1.1.1 >>>> library(parallel) >>>> ?makeForkClsuter >>> No documentation for 'makeForkClsuter' in specified packages and libraries: >>> you could try '??makeForkClsuter' >>> >>> >>> My second question is about a different error message, which seems very close to one you dealt with on the mailing list a few weeks ago, but the person never said if it was resolved: >>> >>> http://thread.gmane.org/gmane.science.biology.informatics.conductor/3 >>> 8841/focus=38858 >>> >>> Here is my code and the error: >>> >>> >>> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, >>> + organism="Mmusculus", >>> + chr.sizes=chr.sizes, >>> + filter=myfilt, >>> + readLength=50L, >>> + annotationMethod="biomaRt", >>> + format="aln", >>> + count="exons", >>> + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", >>> + outputFormat="RNAseq" >>> + ) >>> Checking arguments... >>> Fetching annotations... >>> Summarizing counts... >>> Processing 1-Feb_ATCACG_L003_R1_001_export.txt.gz >>> Error in FUN(c("chr1.fa", "chr10.fa", "chr11.fa", "chr12.fa", "chr13.fa", : >>> (list) object cannot be coerced to type 'integer' >>> In addition: Warning message: >>> In easyRNASeq(filesDirectory = extdataDir, organism = "Mmusculus", : >>> There are 429316 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >>> >>> What easyRNASeq is using as my chromosome names is different from my Illumina file (see below). I'm sure this is the source of my error, I just don't know how to change the names and fix this. The file I am dealing with from the sequencing core is aligned I am told as such "The alignment was run in a splice-aware fashion using as input the mm9 build of the mouse genome from UCSC as provided by Illumina." >>> >>> >>> From my Illumina export/fastQ file read in using ReadAlign: >>> >>> >>> table(chromosome(aln)) >>> >>> chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa >>> 285307 221943 395265 115402 108108 127673 >>> chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa >>> 159310 143801 201871 83037 156612 304649 >>> chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa >>> 168498 222071 234236 229869 297021 194308 >>> chr9.fa chrX.fa chrY.fa NM QC RM >>> 264232 124942 3 1009813 5877 1074323 >>> splice_sites-auto.fa >>> 804227 >>> >>> >>> >>> >>> as.list(seqlengths(Mmusculus)) >>> >>> >>> $chr1 >>> [1] 197195432 >>> >>>$chr2 >>> [1] 181748087 >>> >>> $chr3 >>> [1] 159599783 >>> >>>$chr4 >>> [1] 155630120 >>> >>> $chr5 >>> [1] 152537259 >>> >>>$chr6 >>> [1] 149517037 >>> >>> $chr7 >>> [1] 152524553 >>> >>>$chr8 >>> [1] 131738871 >>> >>> $chr9 >>> [1] 124076172 >>> >>>$chr10 >>> [1] 129993255 >>> >>> $chr11 >>> [1] 121843856 >>> >>>$chr12 >>> [1] 121257530 >>> >>> $chr13 >>> [1] 120284312 >>> >>>$chr14 >>> [1] 125194864 >>> >>> $chr15 >>> [1] 103494974 >>> >>>$chr16 >>> [1] 98319150 >>> >>> $chr17 >>> [1] 95272651 >>> >>>$chr18 >>> [1] 90772031 >>> >>> $chr19 >>> [1] 61342430 >>> >>>$chrX >>> [1] 166650296 >>> >>> $chrY >>> [1] 15902555 >>> >>>$chrM >>> [1] 16299 >>> >>> $chr1_random >>> [1] 1231697 >>> >>>$chr3_random >>> [1] 41899 >>> >>> $chr4_random >>> [1] 160594 >>> >>>$chr5_random >>> [1] 357350 >>> >>> $chr7_random >>> [1] 362490 >>> >>>$chr8_random >>> [1] 849593 >>> >>> $chr9_random >>> [1] 449403 >>> >>>$chr13_random >>> [1] 400311 >>> >>> $chr16_random >>> [1] 3994 >>> >>>$chr17_random >>> [1] 628739 >>> >>> $chrX_random >>> [1] 1785075 >>> >>>$chrY_random >>> [1] 58682461 >>> >>> $chrUn_random >>> [1] 5900358 >>> >>> >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > ADD COMMENT 0 Entering edit mode @delhommeemblde-3232 Last seen 7.1 years ago Hi Wade, Great! Thanks for the feedback! Let me know for the other issue once you've got the time to try. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 2 Mar 2012, at 15:33, Davis, Wade wrote: > Hi Nico, > Everything worked fine on your examples. Please see below for output. > > Thanks, > Wade > > > R Under development (unstable) (2012-02-28 r58513) > Copyright (C) 2012 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-pc-mingw32/x64 (64-bit) > > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > > Natural language support but running in an English locale > > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > >> utils:::menuInstallLocal() >> utils:::menuInstallLocal() >> libary(easyRNASeq) > Error: could not find function "libary" >> library(easyRNASeq) > Loading required package: parallel > Loading required package: genomeIntervals > Loading required package: intervals > Loading required package: Biobase > Loading required package: BiocGenerics > > Attaching package: 'BiocGenerics' > > The following object(s) are masked from 'package:stats': > > xtabs > > The following object(s) are masked from 'package:base': > > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, > intersect, lapply, Map, mapply, order, paste, pmax, pmax.int, pmin, > pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, > setdiff, table, tapply, union, unique > > Welcome to Bioconductor > > Vignettes contain introductory material; view with > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")', and for packages 'citation("pkgname")'. > > Loading required package: biomaRt > Loading required package: edgeR > Loading required package: limma > Loading required package: Biostrings > Loading required package: IRanges > > Attaching package: 'IRanges' > > The following object(s) are masked from 'package:intervals': > > reduce > > > Attaching package: 'Biostrings' > > The following object(s) are masked from 'package:intervals': > > type > > Loading required package: BSgenome > Loading required package: GenomicRanges > > Attaching package: 'GenomicRanges' > > The following object(s) are masked from 'package:genomeIntervals': > > strand, strand<- > > Loading required package: DESeq > Loading required package: locfit > Loading required package: akima > Loading required package: lattice > locfit 1.5-6 2010-01-20 > Loading required package: Rsamtools > Loading required package: ShortRead > Loading required package: latticeExtra > Loading required package: RColorBrewer > > Attaching package: 'ShortRead' > > The following object(s) are masked from 'package:locfit': > > left, right > > Warning messages: > 1: replacing previous import 'coerce' when loading 'intervals' > 2: replacing previous import 'initialize' when loading 'intervals' >> sessionInfo() > R Under development (unstable) (2012-02-28 r58513) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] easyRNASeq_1.1.8 ShortRead_1.13.13 latticeExtra_0.6-20 RColorBrewer_1.0-5 > [5] Rsamtools_1.7.37 DESeq_1.7.6 locfit_1.5-6 lattice_0.20-0 > [9] akima_0.5-7 BSgenome_1.23.2 GenomicRanges_1.7.28 Biostrings_2.23.6 > [13] IRanges_1.13.26 edgeR_2.5.9 limma_3.11.15 biomaRt_2.11.1 > [17] Biobase_2.15.3 BiocGenerics_0.1.7 genomeIntervals_1.11.0 intervals_0.13.3 > > loaded via a namespace (and not attached): > [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 > [6] geneplotter_1.33.1 grid_2.15.0 hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1 > [11] splines_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4.1 xtable_1.7-0 > [16] zlibbioc_1.1.1 >> library("easyRNASeq") >> library("RnaSeqTutorial") >> library(BSgenome.Dmelanogaster.UCSC.dm3) >> >> rnaSeq <- easyRNASeq(system.file( > + "extdata", > + package="RnaSeqTutorial"), > + organism="Dmelanogaster", > + chr.sizes=as.list(seqlengths(Dmelanogaster)), > + readLength=36L, > + annotationMethod="rda", > + annotationFile=system.file( > + "data", > + "gAnnot.rda", > + package="RnaSeqTutorial"), > + format="bam", > + count="genes", > + summarization="geneModels", > + pattern="[A,C,T,G]{6}\\.bam$", > + outputFormat="RNAseq") > Checking arguments... > Fetching annotations... > Computing gene models... > Summarizing counts... > Processing ACACTG.bam > Processing ACTAGC.bam > Processing ATGGCT.bam > Processing TTGCGA.bam > Preparing output > Warning message: > In easyRNASeq(system.file("extdata", package = "RnaSeqTutorial"), : > There are 2238 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >> rnaSeqPar <- easyRNASeq(system.file( > + "extdata", > + package="RnaSeqTutorial"), > + organism="Dmelanogaster", > + chr.sizes=as.list(seqlengths(Dmelanogaster)), > + readLength=36L, > + annotationMethod="rda", > + annotationFile=system.file( > + "data", > + "gAnnot.rda", > + package="RnaSeqTutorial"), > + format="bam", > + count="genes", > + summarization="geneModels", > + pattern="[A,C,T,G]{6}\\.bam$", > + outputFormat="RNAseq", > + nbCore=2) > Checking arguments... > Fetching annotations... > Computing gene models... > Summarizing counts... > Processing ACACTG.bam > Processing ACTAGC.bam > Processing ATGGCT.bam > Processing TTGCGA.bam > Preparing output > Warning message: > In easyRNASeq(system.file("extdata", package = "RnaSeqTutorial"), : > There are 2238 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >> ADD COMMENT 0 Entering edit mode @delhommeemblde-3232 Last seen 7.1 years ago Hi Wade, On 2 Mar 2012, at 22:38, Davis, Wade wrote: > Hi Nico, > I tried your first solution to the second problem with these results: > >> #rerun you command with the chr.map argument and a "custom" organism argument >> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir, > + organism="custom", > + chr.sizes=chr.sizes, > + chr.map=chr.map, > + filter=myfilt, > + readLength=50L, > + annotationMethod="biomaRt", > + format="aln", > + count="exons", > + pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$", > + outputFormat="RNAseq" > + ) > Checking arguments... > Fetching annotations... > Error in .getBmRange(organismName(obj), ignoreWarnings = ignoreWarnings, : > The organism custom is not supported by the ensembl biomaRt. > Right, I did not think of that. You can easily fetch the annotation with easyRNASeq. Now that I have the data, I'll give it a try and post the code afterwards. > > I could try your second solution, but I don't have any local annotation file. I could try to get one. From what I could find on the internet, I think the sequencing core used a refflat file (?), which I think is the standard format of ELAND annotation file. I can contact the sequencing core and try to get what they used. But in general, I like the flexibility of BiomaRt. As I said it can be fetched by easyRNASeq. > > A third option is to use samtools /perl and edit the header / chromosome names and export as BAM. In the future, I think I am just going to ask the core for BAM files with "more common" chromosome naming. > Yes, having less complexity is always best. I had the same "issue" with our sequencing facility and it turned out to be easy for them to edit the reference file they were using (pure fasta) for ELAND. Cheers, Nico > > Thanks, > Wade > > > > > -----Original Message----- > From: Nicolas Delhomme [mailto:delhomme at embl.de] > Sent: Friday, March 02, 2012 8:37 AM > To: Davis, Wade > Cc: Bioconductor mailing list > Subject: Re: [BioC] easyRNASeq error > > Hi Wade, > > Great! Thanks for the feedback! > > Let me know for the other issue once you've got the time to try. > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 2 Mar 2012, at 15:33, Davis, Wade wrote: > >> Hi Nico, >> Everything worked fine on your examples. Please see below for output. >> >> Thanks, >> Wade >> >> >> R Under development (unstable) (2012-02-28 r58513) Copyright (C) 2012 >> The R Foundation for Statistical Computing ISBN 3-900051-07-0 >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> R is free software and comes with ABSOLUTELY NO WARRANTY. >> You are welcome to redistribute it under certain conditions. >> Type 'license()' or 'licence()' for distribution details. >> >> Natural language support but running in an English locale >> >> R is a collaborative project with many contributors. >> Type 'contributors()' for more information and 'citation()' on how to >> cite R or R packages in publications. >> >> Type 'demo()' for some demos, 'help()' for on-line help, or >> 'help.start()' for an HTML browser interface to help. >> Type 'q()' to quit R. >> >>> utils:::menuInstallLocal() >>> utils:::menuInstallLocal() >>> libary(easyRNASeq) >> Error: could not find function "libary" >>> library(easyRNASeq) >> Loading required package: parallel >> Loading required package: genomeIntervals Loading required package: >> intervals Loading required package: Biobase Loading required package: >> BiocGenerics >> >> Attaching package: 'BiocGenerics' >> >> The following object(s) are masked from 'package:stats': >> >> xtabs >> >> The following object(s) are masked from 'package:base': >> >> anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, >> intersect, lapply, Map, mapply, order, paste, pmax, pmax.int, pmin, >> pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, >> setdiff, table, tapply, union, unique >> >> Welcome to Bioconductor >> >> Vignettes contain introductory material; view with >> 'browseVignettes()'. To cite Bioconductor, see >> 'citation("Biobase")', and for packages 'citation("pkgname")'. >> >> Loading required package: biomaRt >> Loading required package: edgeR >> Loading required package: limma >> Loading required package: Biostrings >> Loading required package: IRanges >> >> Attaching package: 'IRanges' >> >> The following object(s) are masked from 'package:intervals': >> >> reduce >> >> >> Attaching package: 'Biostrings' >> >> The following object(s) are masked from 'package:intervals': >> >> type >> >> Loading required package: BSgenome >> Loading required package: GenomicRanges >> >> Attaching package: 'GenomicRanges' >> >> The following object(s) are masked from 'package:genomeIntervals': >> >> strand, strand<- >> >> Loading required package: DESeq >> Loading required package: locfit >> Loading required package: akima >> Loading required package: lattice >> locfit 1.5-6 2010-01-20 >> Loading required package: Rsamtools >> Loading required package: ShortRead >> Loading required package: latticeExtra Loading required package: >> RColorBrewer >> >> Attaching package: 'ShortRead' >> >> The following object(s) are masked from 'package:locfit': >> >> left, right >> >> Warning messages: >> 1: replacing previous import 'coerce' when loading 'intervals' >> 2: replacing previous import 'initialize' when loading 'intervals' >>> sessionInfo() >> R Under development (unstable) (2012-02-28 r58513) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] easyRNASeq_1.1.8 ShortRead_1.13.13 latticeExtra_0.6-20 RColorBrewer_1.0-5 >> [5] Rsamtools_1.7.37 DESeq_1.7.6 locfit_1.5-6 lattice_0.20-0 >> [9] akima_0.5-7 BSgenome_1.23.2 GenomicRanges_1.7.28 Biostrings_2.23.6 >> [13] IRanges_1.13.26 edgeR_2.5.9 limma_3.11.15 biomaRt_2.11.1 >> [17] Biobase_2.15.3 BiocGenerics_0.1.7 genomeIntervals_1.11.0 intervals_0.13.3 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 >> [6] geneplotter_1.33.1 grid_2.15.0 hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1 >> [11] splines_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4.1 xtable_1.7-0 >> [16] zlibbioc_1.1.1 >>> library("easyRNASeq") >>> library("RnaSeqTutorial") >>> library(BSgenome.Dmelanogaster.UCSC.dm3) >>> >>> rnaSeq <- easyRNASeq(system.file( >> + "extdata", >> + package="RnaSeqTutorial"), >> + organism="Dmelanogaster", >> + chr.sizes=as.list(seqlengths(Dmelanogaster)), >> + readLength=36L, >> + annotationMethod="rda", >> + annotationFile=system.file( >> + "data", >> + "gAnnot.rda", >> + package="RnaSeqTutorial"), >> + format="bam", >> + count="genes", >> + summarization="geneModels", >> + pattern="[A,C,T,G]{6}\\.bam$", >> + outputFormat="RNAseq") >> Checking arguments... >> Fetching annotations... >> Computing gene models... >> Summarizing counts... >> Processing ACACTG.bam >> Processing ACTAGC.bam >> Processing ATGGCT.bam >> Processing TTGCGA.bam >> Preparing output >> Warning message: >> In easyRNASeq(system.file("extdata", package = "RnaSeqTutorial"), : >> There are 2238 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >>> rnaSeqPar <- easyRNASeq(system.file( >> + "extdata", >> + package="RnaSeqTutorial"), >> + organism="Dmelanogaster", >> + chr.sizes=as.list(seqlengths(Dmelanogaster)), >> + readLength=36L, >> + annotationMethod="rda", >> + annotationFile=system.file( >> + "data", >> + "gAnnot.rda", >> + package="RnaSeqTutorial"), >> + format="bam", >> + count="genes", >> + summarization="geneModels", >> + pattern="[A,C,T,G]{6}\\.bam$", >> + outputFormat="RNAseq", >> + nbCore=2) >> Checking arguments... >> Fetching annotations... >> Computing gene models... >> Summarizing counts... >> Processing ACACTG.bam >> Processing ACTAGC.bam >> Processing ATGGCT.bam >> Processing TTGCGA.bam >> Preparing output >> Warning message: >> In easyRNASeq(system.file("extdata", package = "RnaSeqTutorial"), : >> There are 2238 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >>> >