question about BSgenome and C.pombe genome, absent in USCS

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 hours ago

Seattle, WA, United States

Hi Boris, On 04/10/2012 01:52 PM, Zybaylov, Boris L wrote: > Dear list, > > I need to find some sequence patterns in C.pombe genome. > > I thought to use for it BSGenome and matchPattern function, however S.pombe is not available in BSgenome. > > Is it possible to install C.pombe genome from another source (not USCS) and still use it with BSgenome? If you are looking for S. pombe (not C. pombe), the sequences of the 3 chromosomes are available here (as fasta files): ftp://ftp.sanger.ac.uk/pub/yeast/pombe/Chromosome_contigs/ Although it would be convenient to wrap the 3 sequences in a BSgenome data package (and this is pretty easy to do, see the BSgenomeForge vignette in the BSgenome software package), you don't really need one. You can just load each sequence individually in your session with read.DNAStringSet(). Hope this helps. Cheers, H. > > If not, what are other options? > > Thank you! > > Best, > > Boris > > > > Dr. Boris Zybaylov > Instructor > Department of Biochemistry and Molecular Biology > University of Arkansas Medical Sciences > Little Rock, AR > 1-501-686-7254 > Confidentiality Notice: This e-mail message, including a...{{dropped:10}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer BSgenome BSgenome Cancer BSgenome BSgenome • 947 views

ADD COMMENT • link updated 12.0 years ago by Henry Paik ▴ 30 • written 12.0 years ago by Hervé Pagès 16k

0

Entering edit mode

Henry Paik ▴ 30

@henry-paik-4630

Last seen 9.6 years ago

Hi List and Herv?, First of all, thank you very much for all your hard work on Bioconductor. I have some questions about OverlapEncodings 1. OverlapEncodings object looks like below. > ovenc1 OverlapEncodings of length 2 Loffset Roffset encoding [1] 4 0 2:jm:af: [2] 4 0 2:jm:af: Under encoding, 2 is the number of exons in this junction. What do jm and af (or i, am, gm,...) mean? 2. How could I know which exons are involved to a junction? or how could I know the exon ranks from OverlapEncodings? extractSpannedExonRanks function is not implemented? When I did ?extractSpannedExonRanks I got No documentation for ?extractSpannedExonRanks? 3. How could I know the number of unique reads on unique junctions? I can filter to get the unique match for each read before bioC analysis. But to quantify the reads on the junctions, I think I need the count of the unique read_id on unique junction? I am learning all these from "Overlap encodings" vignette. Are there any other documentation to learn more about this? Thanks much! - Henry > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsamtools_1.8.4 Biostrings_2.24.1 GenomicFeatures_1.8.1 [4] AnnotationDbi_1.18.0 Biobase_2.16.0 GenomicRanges_1.8.5 [7] IRanges_1.14.2 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 bitops_1.0-4.1 BSgenome_1.24.0 compiler_2.15.0 [5] DBI_0.2-5 RCurl_1.91-1 RSQLite_0.11.1 rtracklayer_1.16.1 [9] stats4_2.15.0 tools_2.15.0 XML_3.9-4 zlibbioc_1.2.0

ADD COMMENT • link 12.0 years ago Henry Paik ▴ 30

0

Entering edit mode

Hi Henry, On 05/08/2012 09:18 AM, Henry Paik wrote: > Hi List and Herv?, > > First of all, thank you very much for all your hard work on Bioconductor. > > I have some questions about OverlapEncodings You are more than welcome to have a look at the "overlap encodings" tools that I've started to introduce in BioC 2.10 (in IRanges 1.14 and GenomicRanges 1.8). However please note that what is in BioC 2.10 is incomplete and very rough around the edges. I'm still actively working on the "overlap encodings" stuff but the improvements are only making it to the devel versions of IRanges/GenomicRanges. If you are still interested in trying those tools, please consider using BioC 2.11 (see ?useDevel in the BiocInstaller package for how to do this). In particular please make sure that you always use the latest version of the GenomicRanges package (currently 1.9.14), and expect frequent updates. > > 1. OverlapEncodings object looks like below. > > > ovenc1 > OverlapEncodings of length 2 > Loffset Roffset encoding > [1] 4 0 2:jm:af: > [2] 4 0 2:jm:af: > > Under encoding, 2 is the number of exons in this junction. What do jm > and af (or i, am, gm,...) mean? Encodings are a little bit tricky to interpret. I've tried to do this in the OverlapEncodings man page (in IRanges) but this is the typical situation where a drawing will do better than a long explanation. Encoding 2:jm:af: is pretty common and is explained (with a drawing) in the "Overlap encodings" vignette (in the GenomicRanges package, I highly recommend you have a look at this vignette). It corresponds to this situation: - read (1 gap): ooooo---ooo - transcript: ... >>>>>>>>> >>>>>>>>> ... As you can see, the gap in the read is "compatible" with the splicing of the transcript. The isCompatibleWithSplicing() utility can be used to detect compatible overlaps (see the vignette for the details). So even though here the 2 at the beginning of the encoding can be seen as the number of exons in the junction, more generally speaking this number is just the number of "chunks" of the aligned read i.e. the number of chunks that the aligner broke the read into in order to align it. In other words, this number is always the number of gaps + 1. Note that it's also the number of 1-letter codes between 2 consecutive colons (:) in the encoding. > > 2. How could I know which exons are involved to a junction? or how could > I know the exon ranks from OverlapEncodings? > extractSpannedExonRanks function is not implemented? extractSpannedExonRanks() is implemented in GenomicRanges 1.9 (and its use is covered in the vignette found in this package). Note that for overlaps that are "compatible" with the splicing of the transcript, yes it does return the ranks of the exons involved in the overlap. Also note that a "compatible" overlap can span more than 1 junction i.e. more than 2 exons, e.g. the 3:jmm:agm:aaf: encoding: - read (2 gaps): oo---ooooo---o - transcript: ... >>>>>>>> >>>>> >>>>>>> ... In that case "compatible" means that all the gaps in the alignment are compatible with consecutive junctions in the transcript. See the vignette for more details. > > When I did > ?extractSpannedExonRanks > I got > No documentation for ?extractSpannedExonRanks? That should work with GenomicRanges 1.9. But the man page doesn't contain much yet, sorry. The vignette has a lot more material and is my first priority at the moment. I want to cover and describe some typical use cases in the vignette first and add/adjust whatever low-level tools are needed for that in the GenomicRanges/IRanges infrastructure. The exact set of low-level tools is still subject to frequent changes (at least potentially) so I'm waiting things to stabilize before I write detailed man pages for them. > > 3. How could I know the number of unique reads on unique junctions? It all depends what having "a read on a junction" means exactly. If having "a read on a junction" means that the aligned read has a gap (i.e. N in the cigar) that matches exactly a splice junction, then note you are not looking at the entire overlap between the read and the transcript and you will count a hit even if the full read is not "compatible" with the exon structure of the transcript. If that's what you want to do, then you don't really need overlap encodings. One way to do this: (1) Extract the unique junctions from 'tx' (the GRangesList object containing the transcripts i.e. the 'subject' you passed to findOverlaps): tx_ranges <- range(tx) table(elementLengths(tx_ranges)) # sanity check (a single entry) tx_junctions <- psetdiff(unlist(tx_ranges, use.names=FALSE), tx) stopifnot(identical(elementLengths(tx_junctions) + 1L, elementLengths(tx))) At this point, the junctions are still grouped by transcript so junctions shared by several transcripts are repeated. If you really want unique junctions: unq_tx_junctions <- unique(unlist(tx_junctions, use.names=FALSE)) (2) Extract the unique gaps from the 'reads' (the GRangesList object containing the reads i.e. the 'query' you passed to findOverlaps). Make sure the names on 'reads' are set to the "Query template names" (this should be the case if you made it with 'grglist(gal, order.as.in.query=TRUE)' and if 'gal' itself was made with 'readGappedAlignments( ... , use.names=TRUE)'): reads_ranges <- range(reads) table(elementLengths(reads_ranges)) # sanity check (a single entry) reads_gaps <- psetdiff(unlist(reads_ranges, use.names=FALSE), reads) stopifnot(identical(elementLengths(reads_gaps) + 1L, elementLengths(reads))) all_reads_gaps <- unlist(reads_gaps, use.names=FALSE) names(all_reads_gaps) <- rep.int(names(reads_gaps), elementLengths(reads_gaps)) ## Gaps with the same name belong to the same read. If 2 ## elements in 'all_reads_gaps' have the same name (i.e. ## belong to the same read) and represent the same genomic ## range, then they are duplicates. Let's get rid of those ## duplicates: is_dup <- duplicated( data.frame( a=names(all_reads_gaps), b=as.factor(seqnames(all_reads_gaps)), c=as.factor(strand(all_reads_gaps)), d=start(all_reads_gaps), e=width(all_reads_gaps))) all_reads_gaps <- all_reads_gaps[!is_dup] (3) Count the number of unique reads per unique junction: countOverlaps(unq_tx_junctions, all_reads_gaps, type="equal") But if you also want to take into account "compatibility" between the read and the transcript for doing this kind of counting, then counting "the number of unique reads on *unique* junctions" becomes a more complicated problem. This is because you need to keep each junction in the context of its transcript in order to tell whether there is compatibility or not and at the same time you loose that context when you tabulate on "unique junctions". Something I could work on in the vignette though... > I can filter to get the unique match for each read before bioC analysis. > But to quantify the reads on the junctions, I think I need the count of > the unique read_id on unique junction? > > I am learning all these from "Overlap encodings" vignette. Are there any > other documentation to learn more about this? That's the only place to look at for the moment. Again, make sure you look at the vignette that is in the *devel* version of the GenomicRanges package. Please don't hesitate to send me your feedback on this. Thanks, H. > > Thanks much! > > - Henry > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsamtools_1.8.4 Biostrings_2.24.1 GenomicFeatures_1.8.1 > [4] AnnotationDbi_1.18.0 Biobase_2.16.0 GenomicRanges_1.8.5 > [7] IRanges_1.14.2 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.12.0 bitops_1.0-4.1 BSgenome_1.24.0 compiler_2.15.0 > [5] DBI_0.2-5 RCurl_1.91-1 RSQLite_0.11.1 rtracklayer_1.16.1 > [9] stats4_2.15.0 tools_2.15.0 XML_3.9-4 zlibbioc_1.2.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Hervé Pagès 16k

Login before adding your answer.