Where to get BAM files for easyRNASeq human use case

0

Entering edit mode

Richard Friedman ★ 2.0k

@richard-friedman-513

Last seen 9.6 years ago

Dear List, I am working through the use case in the easyRNASeq vignette with the human data (section 6 of the vignette). I am not sure where the bam files are for the use case. Here is the record of my session: > library(easyRNASeq) Loading required package: parallel Loading required package: genomeIntervals Loading required package: intervals Loading required package: BiocGenerics Attaching package: BiocGenerics The following object(s) are masked from package:stats: xtabs The following object(s) are masked from package:base: anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: biomaRt Loading required package: edgeR Loading required package: limma Loading required package: Biostrings Loading required package: IRanges Attaching package: IRanges The following object(s) are masked from package:intervals: reduce Attaching package: Biostrings The following object(s) are masked from package:intervals: type Loading required package: BSgenome Loading required package: GenomicRanges Loading required package: DESeq Loading required package: locfit locfit 1.5-8 2012-04-25 Attaching package: locfit The following object(s) are masked from package:GenomicRanges: left, right Loading required package: Rsamtools Loading required package: ShortRead Loading required package: lattice Loading required package: latticeExtra Loading required package: RColorBrewer Warning messages: 1: replacing previous import coerce when loading intervals 2: replacing previous import initialize when loading intervals > library(BSgenome.Hsapiens.UCSC.hg19) > chr.sizes=as.list(seqlengths(Hsapiens)) > class(chr.sizes) [1] "list" > bamfiles=dir(getwd(),pattern="*\\.bam$") > bamfiles character(0) > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 easyRNASeq_1.2.3 ShortRead_1.14.4 [4] latticeExtra_0.6-19 RColorBrewer_1.0-5 lattice_0.20-6 [7] Rsamtools_1.8.5 DESeq_1.8.3 locfit_1.5-8 [10] BSgenome_1.24.0 GenomicRanges_1.8.7 Biostrings_2.24.1 [13] IRanges_1.14.4 edgeR_2.6.10 limma_3.12.1 [16] biomaRt_2.12.0 Biobase_2.16.0 genomeIntervals_1.12.0 [19] BiocGenerics_0.2.0 intervals_0.13.3 loaded via a namespace (and not attached): [1] annotate_1.34.1 AnnotationDbi_1.18.1 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.38.0 [6] geneplotter_1.34.0 grid_2.15.1 hwriter_1.3 RCurl_1.91-1 RSQLite_0.11.1 [11] splines_2.15.1 stats4_2.15.1 survival_2.36-14 XML_3.9-4 xtable_1.7-0 [16] zlibbioc_1.2.0 > THANKS! Rich Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman@cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "School is an evil plot to suppress my individuality" Rose Friedman, age15 [[alternative HTML version deleted]]

Cancer BSgenome BSgenome Cancer BSgenome BSgenome • 1.8k views

ADD COMMENT • link updated 11.7 years ago by delhomme@embl.de ★ 1.2k • written 11.7 years ago by Richard Friedman ★ 2.0k

0

Entering edit mode

delhomme@embl.de ★ 1.2k

@delhommeemblde-3232

Last seen 9.6 years ago

Dear Richard, Sorry that this information is missing. I've added this use case after discussing with Francesco Lescai, see http://permalink.gmane.org/gmane .science.biology.informatics.conductor/38858. The point of that use case is to explain the importance of having consistent annotations and I was not expecting it to be used as a tutorial.

ADD COMMENT • link 11.7 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Dear Nico, Thanks for offering to revise the vignette. I always find it best to do a worked example on its original dataset. I am sure that it will be useful to many other workers in this field. I would like then to ask a broader question - one that I was going to ask after I completed the vignette: Is it possible to obtain annotation for RNASeq data analogous to the kind obtained for microarrays? To be specific, when I analyze affymetrix microarrays I get, for each probeset the Entrez gene symbol and a description of the gene which could be several words long, as well as gene ontology categories and pathways. I can output this information as an Excel spreadsheet. When I work through the drosophila vignette with transcriptCounts or geneCounts I got accession numbers (e.g.,"FBtr0005009") but no gene symbols etc. Do you have any suggestions as to how to get Entrez Gene Symbols, descriptions, etc, for RNASeq output with easy RNASeq? Thanks and best wishes, Rich On Aug 16, 2012, at 12:17 PM, Nicolas Delhomme wrote: > Dear Richard, > > Sorry that this information is missing. I've added this use case after discussing with Francesco Lescai, see http://permalink.gmane.org /gmane.science.biology.informatics.conductor/38858. The point of that use case is to explain the importance of having consistent annotations and I was not expecting it to be used as a tutorial. > >> From the email exchange with Francesco, I recall that the data is public and had been retrieved from the ENA (SRA). One accession number I found is: SRR349689. > > I'll try to look up more information about it, but I'm afraid that there are no readily available bam files for it. > > In any case, thanks for pointing that out. I'll try to find out a dataset that could be used for that use case and I'll update the vignette as well. > > Thanks, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Aug 16, 2012, at 6:02 PM, Richard Friedman wrote: > >> Dear List, >> >> I am working through the use case in the easyRNASeq >> vignette with the human data (section 6 of the vignette). >> I am not sure where the bam files are for the use case. >> >> Here is the record of my session: >> >>> library(easyRNASeq) >> Loading required package: parallel >> Loading required package: genomeIntervals >> Loading required package: intervals >> Loading required package: BiocGenerics >> >> Attaching package: ?BiocGenerics? >> >> The following object(s) are masked from ?package:stats?: >> >> xtabs >> >> The following object(s) are masked from ?package:base?: >> >> anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, >> mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, >> rownames, sapply, setdiff, table, tapply, union, unique >> >> Loading required package: Biobase >> Welcome to Bioconductor >> >> Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, >> see 'citation("Biobase")', and for packages 'citation("pkgname")'. >> >> Loading required package: biomaRt >> Loading required package: edgeR >> Loading required package: limma >> Loading required package: Biostrings >> Loading required package: IRanges >> >> Attaching package: ?IRanges? >> >> The following object(s) are masked from ?package:intervals?: >> >> reduce >> >> >> Attaching package: ?Biostrings? >> >> The following object(s) are masked from ?package:intervals?: >> >> type >> >> Loading required package: BSgenome >> Loading required package: GenomicRanges >> Loading required package: DESeq >> Loading required package: locfit >> locfit 1.5-8 2012-04-25 >> >> Attaching package: ?locfit? >> >> The following object(s) are masked from ?package:GenomicRanges?: >> >> left, right >> >> Loading required package: Rsamtools >> Loading required package: ShortRead >> Loading required package: lattice >> Loading required package: latticeExtra >> Loading required package: RColorBrewer >> Warning messages: >> 1: replacing previous import ?coerce? when loading ?intervals? >> 2: replacing previous import ?initialize? when loading ?intervals? >>> library(BSgenome.Hsapiens.UCSC.hg19) >>> chr.sizes=as.list(seqlengths(Hsapiens)) >>> class(chr.sizes) >> [1] "list" >>> bamfiles=dir(getwd(),pattern="*\\.bam$") >>> bamfiles >> character(0) >>> sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 easyRNASeq_1.2.3 ShortRead_1.14.4 >> [4] latticeExtra_0.6-19 RColorBrewer_1.0-5 lattice_0.20-6 >> [7] Rsamtools_1.8.5 DESeq_1.8.3 locfit_1.5-8 >> [10] BSgenome_1.24.0 GenomicRanges_1.8.7 Biostrings_2.24.1 >> [13] IRanges_1.14.4 edgeR_2.6.10 limma_3.12.1 >> [16] biomaRt_2.12.0 Biobase_2.16.0 genomeIntervals_1.12.0 >> [19] BiocGenerics_0.2.0 intervals_0.13.3 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.34.1 AnnotationDbi_1.18.1 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.38.0 >> [6] geneplotter_1.34.0 grid_2.15.1 hwriter_1.3 RCurl_1.91-1 RSQLite_0.11.1 >> [11] splines_2.15.1 stats4_2.15.1 survival_2.36-14 XML_3.9-4 xtable_1.7-0 >> [16] zlibbioc_1.2.0 >>> >> >> THANKS! >> Rich >> >> >> Richard A. Friedman, PhD >> Associate Research Scientist, >> Biomedical Informatics Shared Resource >> Herbert Irving Comprehensive Cancer Center (HICCC) >> Lecturer, >> Department of Biomedical Informatics (DBMI) >> Educational Coordinator, >> Center for Computational Biology and Bioinformatics (C2B2)/ >> National Center for Multiscale Analysis of Genomic Networks (MAGNet) >> Room 824 >> Irving Cancer Research Center >> Columbia University >> 1130 St. Nicholas Ave >> New York, NY 10032 >> (212)851-4765 (voice) >> friedman at cancercenter.columbia.edu >> http://cancercenter.columbia.edu/~friedman/ >> >> "School is an evil plot to suppress my individuality" >> >> Rose Friedman, age15 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 11.7 years ago Richard Friedman ★ 2.0k

0

Entering edit mode

Hi, On Thu, Aug 16, 2012 at 1:17 PM, Richard Friedman <friedman at="" cancercenter.columbia.edu=""> wrote: [snip] > I would like then to ask a broader question - one that I was > going to ask after I completed the vignette: > Is it possible to obtain annotation for RNASeq data analogous > to the kind obtained for microarrays? > To be specific, when I analyze affymetrix microarrays I get, for > each probeset the Entrez gene symbol and a description of the gene > which could be several words long, as well as gene ontology categories > and pathways. I can output this information as an Excel spreadsheet. > When I work through the drosophila vignette with transcriptCounts or > geneCounts I got accession numbers (e.g.,"FBtr0005009") but no gene > symbols etc. > > Do you have any suggestions as to how to get Entrez Gene Symbols, > descriptions, etc, for RNASeq output with easy RNASeq? [/snip] Perhaps I'm missing something, but given accession numbers (or other gene identifiers), it should be pretty straightforward to jimmy up something using the org.*.eg.db packages, no? I suspect you won't get gene descriptions there -- but if I were a gambling man, I would bet you can probably get that last piece of the puzzle from biomaRt. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Steve, Thanks. I use annaffy for microarrays and was hoping for an already-worked-out protocol. I will however look into the package you recommend if no more explicit protocol is available. Best wishes, Rich On Aug 16, 2012, at 1:25 PM, Steve Lianoglou wrote: > Hi, > > On Thu, Aug 16, 2012 at 1:17 PM, Richard Friedman > <friedman at="" cancercenter.columbia.edu=""> wrote: > [snip] >> I would like then to ask a broader question - one that I was >> going to ask after I completed the vignette: >> Is it possible to obtain annotation for RNASeq data analogous >> to the kind obtained for microarrays? >> To be specific, when I analyze affymetrix microarrays I get, for >> each probeset the Entrez gene symbol and a description of the gene >> which could be several words long, as well as gene ontology categories >> and pathways. I can output this information as an Excel spreadsheet. >> When I work through the drosophila vignette with transcriptCounts or >> geneCounts I got accession numbers (e.g.,"FBtr0005009") but no gene >> symbols etc. >> >> Do you have any suggestions as to how to get Entrez Gene Symbols, >> descriptions, etc, for RNASeq output with easy RNASeq? > [/snip] > > Perhaps I'm missing something, but given accession numbers (or other > gene identifiers), it should be pretty straightforward to jimmy up > something using the org.*.eg.db packages, no? > > I suspect you won't get gene descriptions there -- but if I were a > gambling man, I would bet you can probably get that last piece of the > puzzle from biomaRt. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.7 years ago Richard Friedman ★ 2.0k

0

Entering edit mode

On 08/16/2012 10:29 AM, Richard Friedman wrote: > Steve, > > Thanks. I use annaffy for microarrays and was hoping for an > already-worked-out protocol. I will however look into the package > you recommend if no more explicit protocol is available. Not so much an already worked out protocol but an elaboration of Steve's bet An AnnotateSeq package would be a useful addition; the info in annaffy is in the org packages, discoverable with 'cols', 'keytypes' (often synonymous with 'cols'), and accessible via 'select'. The plans for the next release are OrganismDb objects that make the merge that one would do across, say, org*, TxDb*, and GO.db packages transparent. > library(org.Dm.eg.db) > cols(org.Dm.eg.db) [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "CHRLOC" [6] "CHRLOCEND" "ENZYME" "MAP" "PATH" "PMID" [11] "REFSEQ" "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" [16] "ENSEMBLTRANS" "GENENAME" "UNIPROT" "GO" "EVIDENCE" [21] "ONTOLOGY" "FLYBASE" "FLYBASECG" "FLYBASEPROT" > select(org.Dm.eg.db, "FBtr0005009", c("GENENAME", "SYMBOL"), "ENSEMBLTRANS") ENSEMBLTRANS GENENAME SYMBOL 1 FBtr0005009 Muscle protein 20 Mp20 Martin > > Best wishes, > Rich > > On Aug 16, 2012, at 1:25 PM, Steve Lianoglou wrote: > >> Hi, >> >> On Thu, Aug 16, 2012 at 1:17 PM, Richard Friedman >> <friedman at="" cancercenter.columbia.edu=""> wrote: >> [snip] >>> I would like then to ask a broader question - one that I was >>> going to ask after I completed the vignette: >>> Is it possible to obtain annotation for RNASeq data analogous >>> to the kind obtained for microarrays? >>> To be specific, when I analyze affymetrix microarrays I get, for >>> each probeset the Entrez gene symbol and a description of the gene >>> which could be several words long, as well as gene ontology categories >>> and pathways. I can output this information as an Excel spreadsheet. >>> When I work through the drosophila vignette with transcriptCounts or >>> geneCounts I got accession numbers (e.g.,"FBtr0005009") but no gene >>> symbols etc. >>> >>> Do you have any suggestions as to how to get Entrez Gene Symbols, >>> descriptions, etc, for RNASeq output with easy RNASeq? >> [/snip] >> >> Perhaps I'm missing something, but given accession numbers (or other >> gene identifiers), it should be pretty straightforward to jimmy up >> something using the org.*.eg.db packages, no? >> >> I suspect you won't get gene descriptions there -- but if I were a >> gambling man, I would bet you can probably get that last piece of the >> puzzle from biomaRt. >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

Martin, Thanks! That will get me started! Best wishes, Rich On Aug 16, 2012, at 1:34 PM, Martin Morgan wrote: > > Not so much an already worked out protocol but an elaboration of Steve's bet > > An AnnotateSeq package would be a useful addition; the info in annaffy is in the org packages, discoverable with 'cols', 'keytypes' (often synonymous with 'cols'), and accessible via 'select'. The plans for the next release are OrganismDb objects that make the merge that one would do across, say, org*, TxDb*, and GO.db packages transparent. > > > library(org.Dm.eg.db) > > cols(org.Dm.eg.db) > [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "CHRLOC" > [6] "CHRLOCEND" "ENZYME" "MAP" "PATH" "PMID" > [11] "REFSEQ" "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" > [16] "ENSEMBLTRANS" "GENENAME" "UNIPROT" "GO" "EVIDENCE" > [21] "ONTOLOGY" "FLYBASE" "FLYBASECG" "FLYBASEPROT" > > select(org.Dm.eg.db, "FBtr0005009", c("GENENAME", "SYMBOL"), "ENSEMBLTRANS") > ENSEMBLTRANS GENENAME SYMBOL > 1 FBtr0005009 Muscle protein 20 Mp20 > > Martin > [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Richard Friedman ★ 2.0k

0

Entering edit mode

Dear Richard, I've implemented the SummarizedExperiment support in easyRNASeq version 1.3.14 - to be available in a couple of days from Bioc. If you set the outputFormat to "SummarizedExperiment" you'll get an object that contains the annotation used by easyRNASeq in its rowData slot. I've added this to the vignette, see section 6. To makes things easier, I've created a 'count' function that does the same as the 'easyRNASeq' one, but where the SummarizedExperiment is the default output. I plan to have this function supersede 'easyRNASeq' but as you'll be warned it will be subjected to many changes in the future, so don't rely on it yet in your production code. One foreseen extension eased by the use of SummarizedExperiment is to fetch additional annotation using either biomaRt and/or any of the "org" package. You've had some discussion about it in this email thread and if you come up with a solution, let me know, as I could easily integrate it in the package. It's always easier to do such things when one has an example at hand. In addition, I've extended the human use-case to retrieve and align reads using a variety of Bioc packages. Sadly some of them are only available for the unix platform. I'd be really interested in your feedback; that's in section 7 of the updated vignette. Best, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Aug 16, 2012, at 8:12 PM, Richard Friedman wrote: > Martin, > > Thanks! > That will get me started! > > Best wishes, > Rich > > > On Aug 16, 2012, at 1:34 PM, Martin Morgan wrote: > >> >> Not so much an already worked out protocol but an elaboration of Steve's bet >> >> An AnnotateSeq package would be a useful addition; the info in annaffy is in the org packages, discoverable with 'cols', 'keytypes' (often synonymous with 'cols'), and accessible via 'select'. The plans for the next release are OrganismDb objects that make the merge that one would do across, say, org*, TxDb*, and GO.db packages transparent. >> >>> library(org.Dm.eg.db) >>> cols(org.Dm.eg.db) >> [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "CHRLOC" >> [6] "CHRLOCEND" "ENZYME" "MAP" "PATH" "PMID" >> [11] "REFSEQ" "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" >> [16] "ENSEMBLTRANS" "GENENAME" "UNIPROT" "GO" "EVIDENCE" >> [21] "ONTOLOGY" "FLYBASE" "FLYBASECG" "FLYBASEPROT" >>> select(org.Dm.eg.db, "FBtr0005009", c("GENENAME", "SYMBOL"), "ENSEMBLTRANS") >> ENSEMBLTRANS GENENAME SYMBOL >> 1 FBtr0005009 Muscle protein 20 Mp20 >> >> Martin >> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.6 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Hi Rich, There is some annotation available already in easyRNASeq if you use the "RNAseq" outputFormat. The genomicAnnotation slot of that object gives you access to the information read by the easyRNASeq method from either your gtf or gff file or retrieved from BiomaRt. The annotation available would depend on the content of your gtf/gff file (returned as a RangedData object). When using biomaRt to retrieve the annotation, you would only get additional loci information (start, end, strand,...). Your suggestions (Rich, Steve, Martin) are very interesting, I'll jot them down in my TODO list. I haven't considered that earlier as easyRNASeq is at the beginning of the processing pipeline. In most cases, additional analyses are performed and these all have their own formats. Annotating the results of those is at the moment probably the most efficient. I haven't checked but for some of the downstream analyses I support, I should be able to have the annotation kept. If no further analysis is required, I could return an object containing the annotation in addition to the count table. Martin - how standardized has the SummarizedExperiment class become? I suppose it is what I should be using for that purpose, right? One constraint I would have is that I need to generate an output that can easily be re-used by downstream analyses tool such as edgeR, DESeq, DEXSeq,... Do you know of any effort on migrating these "proprietary" object structures towards a common one? Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Aug 16, 2012, at 7:17 PM, Richard Friedman wrote: > Dear Nico, > > Thanks for offering to revise the vignette. I always > find it best to do a worked example on its original dataset. > I am sure that it will be useful to many other workers in this > field. > I would like then to ask a broader question - one that I was > going to ask after I completed the vignette: > Is it possible to obtain annotation for RNASeq data analogous > to the kind obtained for microarrays? > To be specific, when I analyze affymetrix microarrays I get, for > each probeset the Entrez gene symbol and a description of the gene > which could be several words long, as well as gene ontology categories > and pathways. I can output this information as an Excel spreadsheet. > When I work through the drosophila vignette with transcriptCounts or > geneCounts I got accession numbers (e.g.,"FBtr0005009") but no gene > symbols etc. > > Do you have any suggestions as to how to get Entrez Gene Symbols, > descriptions, etc, for RNASeq output with easy RNASeq? > > Thanks and best wishes, > Rich > > > On Aug 16, 2012, at 12:17 PM, Nicolas Delhomme wrote: > >> Dear Richard, >> >> Sorry that this information is missing. I've added this use case after discussing with Francesco Lescai, see http://permalink.gmane.org /gmane.science.biology.informatics.conductor/38858. The point of that use case is to explain the importance of having consistent annotations and I was not expecting it to be used as a tutorial. >> >>> From the email exchange with Francesco, I recall that the data is public and had been retrieved from the ENA (SRA). One accession number I found is: SRR349689. >> >> I'll try to look up more information about it, but I'm afraid that there are no readily available bam files for it. >> >> In any case, thanks for pointing that out. I'll try to find out a dataset that could be used for that use case and I'll update the vignette as well. >> >> Thanks, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On Aug 16, 2012, at 6:02 PM, Richard Friedman wrote: >> >>> Dear List, >>> >>> I am working through the use case in the easyRNASeq >>> vignette with the human data (section 6 of the vignette). >>> I am not sure where the bam files are for the use case. >>> >>> Here is the record of my session: >>> >>>> library(easyRNASeq) >>> Loading required package: parallel >>> Loading required package: genomeIntervals >>> Loading required package: intervals >>> Loading required package: BiocGenerics >>> >>> Attaching package: ?BiocGenerics? >>> >>> The following object(s) are masked from ?package:stats?: >>> >>> xtabs >>> >>> The following object(s) are masked from ?package:base?: >>> >>> anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, >>> mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, >>> rownames, sapply, setdiff, table, tapply, union, unique >>> >>> Loading required package: Biobase >>> Welcome to Bioconductor >>> >>> Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, >>> see 'citation("Biobase")', and for packages 'citation("pkgname")'. >>> >>> Loading required package: biomaRt >>> Loading required package: edgeR >>> Loading required package: limma >>> Loading required package: Biostrings >>> Loading required package: IRanges >>> >>> Attaching package: ?IRanges? >>> >>> The following object(s) are masked from ?package:intervals?: >>> >>> reduce >>> >>> >>> Attaching package: ?Biostrings? >>> >>> The following object(s) are masked from ?package:intervals?: >>> >>> type >>> >>> Loading required package: BSgenome >>> Loading required package: GenomicRanges >>> Loading required package: DESeq >>> Loading required package: locfit >>> locfit 1.5-8 2012-04-25 >>> >>> Attaching package: ?locfit? >>> >>> The following object(s) are masked from ?package:GenomicRanges?: >>> >>> left, right >>> >>> Loading required package: Rsamtools >>> Loading required package: ShortRead >>> Loading required package: lattice >>> Loading required package: latticeExtra >>> Loading required package: RColorBrewer >>> Warning messages: >>> 1: replacing previous import ?coerce? when loading ?intervals? >>> 2: replacing previous import ?initialize? when loading ?intervals? >>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>> chr.sizes=as.list(seqlengths(Hsapiens)) >>>> class(chr.sizes) >>> [1] "list" >>>> bamfiles=dir(getwd(),pattern="*\\.bam$") >>>> bamfiles >>> character(0) >>>> sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 easyRNASeq_1.2.3 ShortRead_1.14.4 >>> [4] latticeExtra_0.6-19 RColorBrewer_1.0-5 lattice_0.20-6 >>> [7] Rsamtools_1.8.5 DESeq_1.8.3 locfit_1.5-8 >>> [10] BSgenome_1.24.0 GenomicRanges_1.8.7 Biostrings_2.24.1 >>> [13] IRanges_1.14.4 edgeR_2.6.10 limma_3.12.1 >>> [16] biomaRt_2.12.0 Biobase_2.16.0 genomeIntervals_1.12.0 >>> [19] BiocGenerics_0.2.0 intervals_0.13.3 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.34.1 AnnotationDbi_1.18.1 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.38.0 >>> [6] geneplotter_1.34.0 grid_2.15.1 hwriter_1.3 RCurl_1.91-1 RSQLite_0.11.1 >>> [11] splines_2.15.1 stats4_2.15.1 survival_2.36-14 XML_3.9-4 xtable_1.7-0 >>> [16] zlibbioc_1.2.0 >>>> >>> >>> THANKS! >>> Rich >>> >>> >>> Richard A. Friedman, PhD >>> Associate Research Scientist, >>> Biomedical Informatics Shared Resource >>> Herbert Irving Comprehensive Cancer Center (HICCC) >>> Lecturer, >>> Department of Biomedical Informatics (DBMI) >>> Educational Coordinator, >>> Center for Computational Biology and Bioinformatics (C2B2)/ >>> National Center for Multiscale Analysis of Genomic Networks (MAGNet) >>> Room 824 >>> Irving Cancer Research Center >>> Columbia University >>> 1130 St. Nicholas Ave >>> New York, NY 10032 >>> (212)851-4765 (voice) >>> friedman at cancercenter.columbia.edu >>> http://cancercenter.columbia.edu/~friedman/ >>> >>> "School is an evil plot to suppress my individuality" >>> >>> Rose Friedman, age15 >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 11.7 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

On 8/17/12 9:48 AM, Nicolas Delhomme wrote: > Martin - how standardized has the SummarizedExperiment class become? > I suppose it is what I should be using for that purpose, right? One > constraint I would have is that I need to generate an output that can > easily be re-used by downstream analyses tool such as edgeR, DESeq, > DEXSeq,... Do you know of any effort on migrating these "proprietary" > object structures towards a common one? We'd be happy to add methods or converters from SummarizedExperiment to DESeq's CountDataSet and DEXSeq's ExonCountSet classes, presumably into these packages. The problem is the reverse direction: SummarizedExperiment insists on having (non-NA) ranges information (start, end, width), while this is not a restriction that would make sense to impose on count tables for DESeq or DEXSeq. Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 11.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi, On Friday, August 17, 2012, Wolfgang Huber wrote: > > > On 8/17/12 9:48 AM, Nicolas Delhomme wrote: [snip] We'd be happy to add methods or converters from SummarizedExperiment to > DESeq's CountDataSet and DEXSeq's ExonCountSet classes, presumably into > these packages. > > The problem is the reverse direction: SummarizedExperiment insists on > having (non-NA) ranges information (start, end, width), while this is not a > restriction that would make sense to impose on count tables for DESeq or > DEXSeq. Interesting. I'm trying to think of why this restriction doesn't make sense for DESeq and co's count tables but I'm drawing a blank. The counts in each row of the count table are surely coming from some genomic locus, no? Are you thinking about thing like gene fusion events or something? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On 08/17/2012 04:36 AM, Steve Lianoglou wrote: > Hi, > > On Friday, August 17, 2012, Wolfgang Huber wrote: > >> >> >> On 8/17/12 9:48 AM, Nicolas Delhomme wrote: > > > [snip] > > We'd be happy to add methods or converters from SummarizedExperiment to >> DESeq's CountDataSet and DEXSeq's ExonCountSet classes, presumably into >> these packages. >> >> The problem is the reverse direction: SummarizedExperiment insists on >> having (non-NA) ranges information (start, end, width), while this is not a >> restriction that would make sense to impose on count tables for DESeq or >> DEXSeq. I recently implemented support for no coordinates, > m = matrix(0, 10, 5, dimnames=list(LETTERS[1:10], letters[1:5])) > sx = SummarizedExperiment(m) ## etc; > class(rowData(sx)) [1] "GRangesList" attr(,"package") [1] "GenomicRanges" the rowData is a GRangesList of length nrow(m), and with all ranges with length 0. Martin > > > Interesting. > > I'm trying to think of why this restriction doesn't make sense for DESeq > and co's count tables but I'm drawing a blank. > > The counts in each row of the count table are surely coming from some > genomic locus, no? > > Are you thinking about thing like gene fusion events or something? > > Thanks, > -steve > > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin thanks! I wasn't aware this worked now (and it wasn't immediately obvious when I tried this morning). So we can move ahead. I'll discuss with Simon and Alejandro about how to proceed (without major disruption or -traction) Steve: you asked why. For DESeq, we have had applications e.g. with count data from mass spec (where there is sometimes no associated genomic interval), or from HiC (where there are typically two genomic intervals for each count). I'd like to keep the flexibility for dealing with these sorts of situations. And people might always have count tables where the coordinates were dropped - while not ideal, there is no reason why DE(X)Seq should refuse to work with these. Best wishes Wolfgang Martin Morgan scripsit 08/17/2012 02:16 PM: > On 08/17/2012 04:36 AM, Steve Lianoglou wrote: >> Hi, >> >> On Friday, August 17, 2012, Wolfgang Huber wrote: >> >>> >>> >>> On 8/17/12 9:48 AM, Nicolas Delhomme wrote: >> >> >> [snip] >> >> We'd be happy to add methods or converters from SummarizedExperiment to >>> DESeq's CountDataSet and DEXSeq's ExonCountSet classes, presumably into >>> these packages. >>> >>> The problem is the reverse direction: SummarizedExperiment insists on >>> having (non-NA) ranges information (start, end, width), while this is >>> not a >>> restriction that would make sense to impose on count tables for DESeq or >>> DEXSeq. > > I recently implemented support for no coordinates, > > > m = matrix(0, 10, 5, dimnames=list(LETTERS[1:10], letters[1:5])) > > sx = SummarizedExperiment(m) ## etc; > > class(rowData(sx)) > [1] "GRangesList" > attr(,"package") > [1] "GenomicRanges" > > the rowData is a GRangesList of length nrow(m), and with all ranges with > length 0. > > Martin > >> >> >> Interesting. >> >> I'm trying to think of why this restriction doesn't make sense for DESeq >> and co's count tables but I'm drawing a blank. >> >> The counts in each row of the count table are surely coming from some >> genomic locus, no? >> >> Are you thinking about thing like gene fusion events or something? >> >> Thanks, >> -steve >> >> >> > > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 11.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang, Can you please let me know once you've got it scheduled? The first deadline of the release 2.11 is in a month and I will add support for SummarizedExperiments in easyRNASeq by then. Changing its interface to DESeq/DEXSeq would not take long either, but it would be good to know whether this would be for 2.11 or 2.12. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Aug 17, 2012, at 2:35 PM, Wolfgang Huber wrote: > > Hi Martin > > thanks! I wasn't aware this worked now (and it wasn't immediately obvious when I tried this morning). So we can move ahead. I'll discuss with Simon and Alejandro about how to proceed (without major disruption or -traction) > > Steve: you asked why. For DESeq, we have had applications e.g. with count data from mass spec (where there is sometimes no associated genomic interval), or from HiC (where there are typically two genomic intervals for each count). I'd like to keep the flexibility for dealing with these sorts of situations. And people might always have count tables where the coordinates were dropped - while not ideal, there is no reason why DE(X)Seq should refuse to work with these. > > Best wishes > Wolfgang > > > > > Martin Morgan scripsit 08/17/2012 02:16 PM: >> On 08/17/2012 04:36 AM, Steve Lianoglou wrote: >>> Hi, >>> >>> On Friday, August 17, 2012, Wolfgang Huber wrote: >>> >>>> >>>> >>>> On 8/17/12 9:48 AM, Nicolas Delhomme wrote: >>> >>> >>> [snip] >>> >>> We'd be happy to add methods or converters from SummarizedExperiment to >>>> DESeq's CountDataSet and DEXSeq's ExonCountSet classes, presumably into >>>> these packages. >>>> >>>> The problem is the reverse direction: SummarizedExperiment insists on >>>> having (non-NA) ranges information (start, end, width), while this is >>>> not a >>>> restriction that would make sense to impose on count tables for DESeq or >>>> DEXSeq. >> >> I recently implemented support for no coordinates, >> >> > m = matrix(0, 10, 5, dimnames=list(LETTERS[1:10], letters[1:5])) >> > sx = SummarizedExperiment(m) ## etc; >> > class(rowData(sx)) >> [1] "GRangesList" >> attr(,"package") >> [1] "GenomicRanges" >> >> the rowData is a GRangesList of length nrow(m), and with all ranges with >> length 0. >> >> Martin >> >>> >>> >>> Interesting. >>> >>> I'm trying to think of why this restriction doesn't make sense for DESeq >>> and co's count tables but I'm drawing a blank. >>> >>> The counts in each row of the count table are surely coming from some >>> genomic locus, no? >>> >>> Are you thinking about thing like gene fusion events or something? >>> >>> Thanks, >>> -steve >>> >>> >>> >> >> > > > -- > Best wishes > Wolfgang > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.7 years ago delhomme@embl.de ★ 1.2k

Login before adding your answer.