Counting number of intervals with or without GRange

0

Entering edit mode

Hermann Norpois ▴ 110

@hermann-norpois-5483

Last seen 11.4 years ago

I have data as GRange object test.gr GRanges with 20 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 10246, 10283] * [2] chr1 [237718, 237887] * [3] chr1 [521553, 521613] * [4] chr1 [564413, 564686] * or as classical dataframe head (test) seqnames start end width strand 1 chr1 10246 10283 38 * 2 chr1 237718 237887 170 * 3 chr1 521553 521613 61 * 4 chr1 564413 564686 274 * and I wish to count how often do intervals of a certain range exist. I would be happy to get something like this: width occurence of the width sum of the occurence 0-50 2 2 50-100 4 6 100-150 5 11 I would prefer to do this using GRange objects directly but I would be happy to know a way how to do it via test (classical dataframe). Could you please give me a hint. Thanks Hermann P.S.: My data: > dput (test) structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chrX", "chrY"), class = "factor"), start = c(10246L, 237718L, 521553L, 564413L, 565252L, 566538L, 567450L, 568777L, 569457L, 669190L, 713884L, 740270L, 752164L, 753266L, 754421L, 762051L, 785811L, 786878L, 791087L, 793485L), end = c(10283L, 237887L, 521613L, 564686L, 566081L, 567275L, 568583L, 569302L, 570125L, 669259L, 714603L, 740423L, 752782L, 753697L, 754711L, 763151L, 786025L, 787053L, 791138L, 793582L), width = c(38L, 170L, 61L, 274L, 830L, 738L, 1134L, 526L, 669L, 70L, 720L, 154L, 619L, 432L, 291L, 1101L, 215L, 176L, 52L, 98L), strand = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("+", "-", "*"), class = "factor")), .Names = c("seqnames", "start", "end", "width", "strand"), row.names = c(NA, 20L), class = "data.frame") > dput test.gr) new("GRanges" , seqnames = new("Rle" , values = structure(1L, .Label = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chrX", "chrY"), class = "factor") , lengths = 20L , elementMetadata = NULL , metadata = list() ) , ranges = new("IRanges" , start = c(10246L, 237718L, 521553L, 564413L, 565252L, 566538L, 567450L, 568777L, 569457L, 669190L, 713884L, 740270L, 752164L, 753266L, 754421L, 762051L, 785811L, 786878L, 791087L, 793485L) , width = c(38L, 170L, 61L, 274L, 830L, 738L, 1134L, 526L, 669L, 70L, 720L, 154L, 619L, 432L, 291L, 1101L, 215L, 176L, 52L, 98L) , NAMES = NULL , elementType = "integer" , elementMetadata = NULL , metadata = list() ) , strand = new("Rle" , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") , lengths = 20L , elementMetadata = NULL , metadata = list() ) , elementMetadata = new("DataFrame" , rownames = NULL , nrows = 20L , listData = structure(list(), .Names = character(0)) , elementType = "ANY" , elementMetadata = NULL , metadata = list() ) , seqinfo = new("Seqinfo" , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chrX", "chrY") , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_) , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA) , genome = c(NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_ ) ) , metadata = list() ) > [[alternative HTML version deleted]]

• 1.3k views

ADD COMMENT • link updated 13.1 years ago by Martin Morgan 25k • written 13.1 years ago by Hermann Norpois ▴ 110

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 1 day ago

United States

On 12/23/2012 08:42 AM, Hermann Norpois wrote: > I have data as GRange object > > test.gr > GRanges with 20 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 10246, 10283] * > [2] chr1 [237718, 237887] * > [3] chr1 [521553, 521613] * > [4] chr1 [564413, 564686] * > > or as classical dataframe > > head (test) > seqnames start end width strand > 1 chr1 10246 10283 38 * > 2 chr1 237718 237887 170 * > 3 chr1 521553 521613 61 * > 4 chr1 564413 564686 274 * > > and I wish to count how often do intervals of a certain range exist. I > would be happy to get something like this: > > width occurence of the width sum of the occurence > 0-50 2 2 > 50-100 4 6 > 100-150 5 11 > > I would prefer to do this using GRange objects directly but I would be > happy to know a way how to do it via test (classical dataframe). > Could you please give me a hint. Hi Hermann -- Maybe > w = widthtest.gr) > as.data.frame(table(cut(w, seq(0, max(w), by=50)))) Var1 Freq 1 (0,50] 1 2 (50,100] 4 3 (100,150] 0 4 (150,200] 3 ... or a little more 'raw' > tabulate(cut(w, seq(0, max(w), by=50))) [1] 1 4 0 3 1 2 0 0 1 0 1 0 1 1 2 0 1 > cumsum(tabulate(cut(w, seq(0, max(w), by=50)))) [1] 1 5 5 8 9 11 11 11 12 12 13 13 14 15 17 17 18 ? Martin > > > Thanks Hermann > > > P.S.: My data: > >> dput (test) > structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr1", > "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", > "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", "chr22", > "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chrX", > "chrY"), class = "factor"), start = c(10246L, 237718L, 521553L, > 564413L, 565252L, 566538L, 567450L, 568777L, 569457L, 669190L, > 713884L, 740270L, 752164L, 753266L, 754421L, 762051L, 785811L, > 786878L, 791087L, 793485L), end = c(10283L, 237887L, 521613L, > 564686L, 566081L, 567275L, 568583L, 569302L, 570125L, 669259L, > 714603L, 740423L, 752782L, 753697L, 754711L, 763151L, 786025L, > 787053L, 791138L, 793582L), width = c(38L, 170L, 61L, 274L, 830L, > 738L, 1134L, 526L, 669L, 70L, 720L, 154L, 619L, 432L, 291L, 1101L, > 215L, 176L, 52L, 98L), strand = structure(c(3L, 3L, 3L, 3L, 3L, > 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = > c("+", > "-", "*"), class = "factor")), .Names = c("seqnames", "start", > "end", "width", "strand"), row.names = c(NA, 20L), class = "data.frame") >> dput test.gr) > new("GRanges" > , seqnames = new("Rle" > , values = structure(1L, .Label = c("chr1", "chr10", "chr11", "chr12", > "chr13", > "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2", > "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", > "chr8", "chr9", "chrX", "chrY"), class = "factor") > , lengths = 20L > , elementMetadata = NULL > , metadata = list() > ) > , ranges = new("IRanges" > , start = c(10246L, 237718L, 521553L, 564413L, 565252L, 566538L, > 567450L, > 568777L, 569457L, 669190L, 713884L, 740270L, 752164L, 753266L, > 754421L, 762051L, 785811L, 786878L, 791087L, 793485L) > , width = c(38L, 170L, 61L, 274L, 830L, 738L, 1134L, 526L, 669L, 70L, > 720L, > 154L, 619L, 432L, 291L, 1101L, 215L, 176L, 52L, 98L) > , NAMES = NULL > , elementType = "integer" > , elementMetadata = NULL > , metadata = list() > ) > , strand = new("Rle" > , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") > , lengths = 20L > , elementMetadata = NULL > , metadata = list() > ) > , elementMetadata = new("DataFrame" > , rownames = NULL > , nrows = 20L > , listData = structure(list(), .Names = character(0)) > , elementType = "ANY" > , elementMetadata = NULL > , metadata = list() > ) > , seqinfo = new("Seqinfo" > , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14", > "chr15", > "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", > "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", > "chrX", "chrY") > , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_) > , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, NA, > NA, NA, NA, NA, NA, NA, NA, NA, NA) > , genome = c(NA_character_, NA_character_, NA_character_, > NA_character_, > NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, > NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, > NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, > NA_character_, NA_character_, NA_character_, NA_character_, NA_character_ > ) > ) > , metadata = list() > ) >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 13.1 years ago Martin Morgan 25k

0

Entering edit mode

Woops, didn't see this, sorry .... this is the ticket, though :-) On Sunday, December 23, 2012, Martin Morgan wrote: > On 12/23/2012 08:42 AM, Hermann Norpois wrote: > >> I have data as GRange object >> >> test.gr >> GRanges with 20 ranges and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] chr1 [ 10246, 10283] * >> [2] chr1 [237718, 237887] * >> [3] chr1 [521553, 521613] * >> [4] chr1 [564413, 564686] * >> >> or as classical dataframe >> >> head (test) >> seqnames start end width strand >> 1 chr1 10246 10283 38 * >> 2 chr1 237718 237887 170 * >> 3 chr1 521553 521613 61 * >> 4 chr1 564413 564686 274 * >> >> and I wish to count how often do intervals of a certain range exist. I >> would be happy to get something like this: >> >> width occurence of the width sum of the occurence >> 0-50 2 2 >> 50-100 4 6 >> 100-150 5 11 >> >> I would prefer to do this using GRange objects directly but I would be >> happy to know a way how to do it via test (classical dataframe). >> Could you please give me a hint. >> > > Hi Hermann -- > > Maybe > > > w = widthtest.gr) > > as.data.frame(table(cut(w, seq(0, max(w), by=50)))) > Var1 Freq > 1 (0,50] 1 > 2 (50,100] 4 > 3 (100,150] 0 > 4 (150,200] 3 > ... > > or a little more 'raw' > > > tabulate(cut(w, seq(0, max(w), by=50))) > [1] 1 4 0 3 1 2 0 0 1 0 1 0 1 1 2 0 1 > > cumsum(tabulate(cut(w, seq(0, max(w), by=50)))) > [1] 1 5 5 8 9 11 11 11 12 12 13 13 14 15 17 17 18 > > ? Martin > > > > > > Thanks Hermann > > > P.S.: My data: > > dput (test) > > structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr1", > "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", > "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", "chr22", > "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chrX", > "chrY"), class = "factor"), start = c(10246L, 237718L, 521553L, > 564413L, 565252L, 566538L, 567450L, 568777L, 569457L, 669190L, > 713884L, 740270L, 752164L, 753266L, 754421L, 762051L, 785811L, > 786878L, 791087L, 793485L), end = c(10283L, 237887L, 521613L, > 564686L, 566081L, 567275L, 568583L, 569302L, 570125L, 669259L, > 714603L, 740423L, 752782L, 753697L, 754711L, 763151L, 786025L, > 787053L, 791138L, 793582L), width = c(38L, 170L, 61L, 274L, 830L, > 738L, 1134L, 526L, 669L, 70L, 720L, 154L, 619L, 432L, 291L, 1101L, > 215L, 176L, 52L, 98L), strand = structure(c(3L, 3L, 3L, 3L, 3L, > 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = > c("+", > "-", "*"), class = "factor")), .Names = c("seqnames", "start", > "end", "width", "strand"), row.names = c(NA, 20L), class = "data.frame") > > dput test.gr) > > new("GRanges" > , seqnames = new("Rle" > , values = structure(1L, .Label = c("chr1", "chr10", "chr11", "chr12", > "chr13", > "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2", > "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", > "chr8", "chr9", "chrX", "chrY"), class = "factor") > , lengths = 20L > , elementMetadata = NULL > , metadata = list() > ) > , ranges = new("IRanges" > , start = c(10246L, 237718L, 521553L, 564413L, 565252L, 566538L, > 567450L, > 568777L, 569457L, 669190L, 713884L, 740270L, 752164L, 753266L, > 754421L, 762051L, 785811L, 786878L, 791087L, 793485L) > , width = c(38L, 170L, 61L, 274L, 830L, 738L, 1134L, 526L, 669L, 70L, > 720L, > 154L, 619L, 432L, 291L, 1101L, 215L, 176L, 52L, 98L) > , NAMES = NULL > , elementType = "integer" > , elementMetadata = NULL > , metadata = list() > ) > , strand = new("Rle" > , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") > , lengths = 20L > , elementMetadata = NULL > , metadata = list() > ) > , elementMetadata = new("DataFrame" > , rownames = NULL > , nrows = 20L > , listData = structure(list(), .Names = character(0)) > , elementType = "ANY" > , elementMetadata = NULL > , metadata = list() > ) > , seqinfo = new("Seqinfo" > , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14", > "chr15", > "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21", > "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", > "chrX", "chrY") > , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, > NA_integer_, NA_integer_, NA_integer_, NA_integer_) > , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, NA, > NA, NA, NA, NA, NA, NA, NA, NA, NA) > , genome = c(NA_character_, NA_character_, NA_character_, > NA_character_, > NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, > NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, > NA_character_, NA_character_, NA_character_, NA_characte > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Steve Lianoglou ★ 13k

Login before adding your answer.