Question

IRanges package: findOverlaps on blobs

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Fahim, On 11-05-31 10:23 AM, Fahim Mohammad wrote: > Hi > Thanks for reply > > RefSeqID targetName strand blockSizes > queryStart targetStart > XM_001065892.1 chr4 + 127,986, > 0,127, 124513961,124514706, > XM_578205.2 chr2 - 535,137,148, > 0,535,672, 155875533,155879894,155895543, > NM_012543.2 chr1 + 506,411,212,494, > 0,506,917,1129, 96173572,96174920,96176574,96177991, > > Is each "block" for each RefSeqID treated individually, or do you want one > overlap to count for all of them? > *Ans: I am looking for the latter option and want one overlap to count for > all of the blocks. For example, If I am given the following sequence with > its alignment > UnknownSeq targetName strand blockSizes > queryStart targetStart > XM_ABCD chr4 + 100, 200, 500 > 0,200, 1200 124513961,124514706, 124515900 > > I am interested in finding which of the three RefSeqIDs above overlaps with > the given unknown sequence. The obvious answer is the first refseqID > (XM_001065892.1). > > > Lastly -- how are these ranges defined? > Is the first row supposed to be turned into: > * The intervals on the target genome for each unknown sequence can be found > using blockSizes and targetStart. For me, the "queryStart" field above is > redundant and wont be used. The first row may be > The intervals in the first row may be found using (targetStart + blockSizes > -1) > chr4 (124513961) -- (124513961 + 127 -1) > chr4 (124514706) -- (124514706 + 986 -1) So it looks like the first thing you might want to do is to import your file into a GRangesList object. Which can be done with something like: library(GenomicRanges) refseqs <- read.table("RefSeqs.txt", header=TRUE, stringsAsFactors=FALSE) starts <- strsplitAsListOfIntegerVectors(refseqs$targetStart) widths <- strsplitAsListOfIntegerVectors(refseqs$blockSizes) ranges <- IRanges(start=unlist(starts), width=unlist(widths)) seqnames <- Rle(factor(refseqs$targetName), elementLengths(starts)) strand <- Rle(strand(refseqs$strand), elementLengths(starts)) gr <- GRanges(seqnames, ranges, strand) grl <- split(gr, rep.int(seq_len(length(starts)), elementLengths(starts))) names(grl) <- refseqs$RefSeqID You should end up with something that looks like this: > grl GRangesList of length 3 $XM_001065892.1 GRanges with 2 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr4 [124513961, 124514087] + | [2] chr4 [124514706, 124515691] + | $XM_578205.2 GRanges with 3 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr2 [155875533, 155876067] - | [2] chr2 [155879894, 155880030] - | [3] chr2 [155895543, 155895690] - | $NM_012543.2 GRanges with 4 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr1 [96173572, 96174077] + | [2] chr1 [96174920, 96175330] + | [3] chr1 [96176574, 96176785] + | [4] chr1 [96177991, 96178484] + | seqlengths chr1 chr2 chr4 NA NA NA which you can then use for finding overlaps with any other range-based object that you have. Hope this helps. Cheers, H. > > Thanks again > Fahim > > > On Tue, May 31, 2011 at 12:05 PM, Steve Lianoglou< > mailinglist.honeypot at gmail.com> wrote: > >> Hi, >> >> On Tue, May 31, 2011 at 11:08 AM, Fahim Mohammad<fahim.md at="" gmail.com=""> >> wrote: >>> Hello >>> I have a file containing a list of aligned Refseq identifier in the >>> following format. I want to convert this file into RangedData format in >>> such a way that "findOverlaps" function be as efficient as possible. I >> know >>> how to do this when each of the identifiers has just a start and an end >>> coordinate. IRanges package is very efficient in finding the overlapping >>> intervals in such case. But when the structure is in the form of blobs >> (as >>> shown below in the last three fields), I am not sure how to convert this >>> structure into RangedData format and how to subsequently call the >>> "findOverlaps" function. >>> >>> RefSeqID targetName strand blockSizes >>> queryStart targetStart >>> XM_001065892.1 chr4 + 127,986, >>> 0,127, 124513961,124514706, >>> XM_578205.2 chr2 - 535,137,148, >>> 0,535,672, 155875533,155879894,155895543, >>> NM_012543.2 chr1 + 506,411,212,494, >>> 0,506,917,1129, 96173572,96174920,96176574,96177991, >> >> I guess the answer lies in how you want overlaps to be calculated. >> >> Is each "block" for each RefSeqID treated individually, or do you want >> one overlap to count for all of them? >> >> If the answer is the latter, you might consider putting the intervals >> into a GRangesList object, where each element in the list is a GRanges >> object that has all the ranges for the particular refseq id ... if you >> want them all individually, then you just have to parse it into a >> GRanges object. >> >> If you're asking *how* to parse the file, there are many ways. I'd >> maybe use a read.table to get this thing into its respective columns, >> then iterate over each row converting it into a GRanges object that >> has as many ranges as "blocks" -- look at ?strsplit if you don't know >> how to do that. >> >> Lastly -- how are these ranges defined? >> Is the first row supposed to be turned into: >> >> chr4 (124513961) -- (124513961 + 0 + 127) >> chr4 (124514706 + 127) -- (124514706 + 127 + 986) >> >> or? >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer convert IRanges Cancer convert IRanges • 1.3k views

ADD COMMENT • link 14.7 years ago Hervé Pagès 16k

score 0 · Answer 1 · 2011-06-16

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

On 11-06-08 01:56 PM, Hervé Pagès wrote: > Hi Fahim, [...] > So it looks like the first thing you might want to do is to import > your file into a GRangesList object. Which can be done with something > like: > > library(GenomicRanges) > refseqs <- read.table("RefSeqs.txt", header=TRUE, > stringsAsFactors=FALSE) > starts <- strsplitAsListOfIntegerVectors(refseqs$targetStart) > widths <- strsplitAsListOfIntegerVectors(refseqs$blockSizes) > ranges <- IRanges(start=unlist(starts), width=unlist(widths)) > seqnames <- Rle(factor(refseqs$targetName), elementLengths(starts)) > strand <- Rle(strand(refseqs$strand), elementLengths(starts)) > gr <- GRanges(seqnames, ranges, strand) > grl <- split(gr, rep.int(seq_len(length(starts)), > elementLengths(starts))) > names(grl) <- refseqs$RefSeqID FWIW, I've added an utility function to the devel version of the GenomicRanges package that takes care of making a GRangesList object from this type of input: library(GenomicRanges) refseqs <- read.table("RefSeqs.txt", header=TRUE, stringsAsFactors=FALSE) grl <- with(refseqs, makeGRangesListFromFeatureFragments( seqnames=targetName, fragmentStarts=targetStart, fragmentWidths=blockSizes, strand=strand)) Sorry for the long and ugly name... (suggestions welcome). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.7 years ago Hervé Pagès 16k

0

Entering edit mode

2011/6/16 Hervé Pagès <hpages@fhcrc.org> > On 11-06-08 01:56 PM, Hervé Pagès wrote: > >> Hi Fahim, >> > [...] > > So it looks like the first thing you might want to do is to import >> your file into a GRangesList object. Which can be done with something >> like: >> >> library(GenomicRanges) >> refseqs <- read.table("RefSeqs.txt", header=TRUE, >> stringsAsFactors=FALSE) >> starts <- strsplitAsListOfIntegerVectors**(refseqs$targetStart) >> widths <- strsplitAsListOfIntegerVectors**(refseqs$blockSizes) >> ranges <- IRanges(start=unlist(starts), width=unlist(widths)) >> seqnames <- Rle(factor(refseqs$targetName)**, elementLengths(starts)) >> strand <- Rle(strand(refseqs$strand), elementLengths(starts)) >> gr <- GRanges(seqnames, ranges, strand) >> grl <- split(gr, rep.int(seq_len(length(starts)**), >> elementLengths(starts))) >> names(grl) <- refseqs$RefSeqID >> > > FWIW, I've added an utility function to the devel version of the > GenomicRanges package that takes care of making a GRangesList object > from this type of input: > > > library(GenomicRanges) > refseqs <- read.table("RefSeqs.txt", header=TRUE, > stringsAsFactors=FALSE) > grl <- with(refseqs, > makeGRangesListFromFeatureFrag**ments( > seqnames=targetName, > fragmentStarts=targetStart, > fragmentWidths=blockSizes, > strand=strand)) > > Sorry for the long and ugly name... (suggestions welcome). > > The "blocks" function in rtracklayer does this for BED-style input (very similar to the above). It takes a RangedData as imported from the BED and outputs a GRanges with a 'tx' column that can be used for splitting into a GRangesList. One major difference is that the fragment starts are specified as offsets from the start of the main feature (the transcript). If your function supported that mode, then the two could be unified. Perhaps import.bed should gain an expandBlocks argument that does this upon import. Michael > Cheers, > H. > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Michael Lawrence ★ 11k