GRanges - coercion from dataframe
1
0
Entering edit mode
Fahim Md ▴ 250
@fahim-md-4018
Last seen 7.1 years ago
Hi I generated a data frame from GRanges object by performing data = as.data.frame (grObj, row.names =NULL); #grObject is GRanges object having 'seqnames' as one of its field. save (data, file = ..) Now I want to read this data file again as GRanges object. Is there any built-in method to do that? ( 'as' method in GRanges is currently is not able to do it.) Alternatively, I tried to convert the data frame into RangedData first and then into GRanges. But the conversion into RangedData assumed the 'seqnames' as metadata and by default inserted 'space' with value '1' in IRanges. This conversion cast a problem for GRanges coersion as 'seqnames' are now in the metadata field which is reserved keyword and thus producing error. > library(IRanges)> library(GenomicRanges)> load('/home/fahim/hg19/GeneName.RData')> head(data) seqnames start end width strand name 1 Chr1 10954 11507 554 + LOC100506145 2 Chr1 12190 13639 1450 + LOC100652771 3 Chr1 14362 29370 15009 - WASH7P 4 Chr1 30366 30503 138 + MIR1302-2 5 Chr1 34611 36081 1471 - FAM138A 6 Chr1 52453 53396 944 + OR4G4P> rd1 = as(data, "RangedData")> rd1RangedData with 36590 rows and 3 value columns across 1 space space ranges | seqnames strand name <factor> <iranges> | <factor> <factor> <factor> 1 1 [ 10954, 11507] | Chr1 + LOC100506145 2 1 [ 12190, 13639] | Chr1 + LOC100652771 3 1 [ 14362, 29370] | Chr1 - WASH7P 4 1 [ 30366, 30503] | Chr1 + MIR1302-2 5 1 [ 34611, 36081] | Chr1 - FAM138A 6 1 [ 52453, 53396] | Chr1 + OR4G4P 7 1 [ 63016, 63885] | Chr1 + OR4G11P 8 1 [ 69091, 70008] | Chr1 + OR4F5 9 1 [131125, 135677] | Chr1 + LOC100420257 ... ... ... ... ... ... ... 36582 1 [59100457, 59115123] | ChrY + SPRY3 36583 1 [59160762, 59162330] | ChrY - AMDP1 36584 1 [59213949, 59276439] | ChrY + VAMP7 36585 1 [59311663, 59311996] | ChrY - TCEB1P24 36586 1 [59318017, 59318918] | ChrY - TRPC6P 36587 1 [59330252, 59343488] | ChrY + IL9R 36588 1 [59354329, 59358343] | ChrY + WASH6P 36589 1 [59358332, 59360854] | ChrY - DDX11L16 36590 1 [59361222, 59361778] | ChrY - LOC100507426> rd2 = as(rd1, "GRanges")Error in validObject(.Object) : invalid class "GRanges" object: slot 'elementMetadata' cannot use "seqnames", "ranges", "strand", "seqlevels", "seqlengths", "isCircular", "genome", "start", "end", "width", "element" as column names > [[alternative HTML version deleted]]
convert IRanges convert IRanges • 4.0k views
0
Entering edit mode
Fahim Md ▴ 250
@fahim-md-4018
Last seen 7.1 years ago
Brute force method to do the same. load('/home/fahim/rugit/rangeData/hg19/GeneName.RData') head(data) fldslen = length(names(data)) #how many fields gr = GRanges(seqnames = Rle(data$seqnames), IRanges(data$start, data$end), strand = Rle(as.character(data$strand)), name = data$name ) if (fldslen > 6) { restfldnames = setdiff(names(data), c('seqnames', 'start', 'end', 'strand', 'width', 'name')) elementMetadata(gr) = datarest } 2011/11/26 Fahim Mohammad <fahim.md@gmail.com> > Hi > > I generated a data frame from GRanges object by performing > > data = as.data.frame (grObj, row.names =NULL); #grObject is GRanges > object having 'seqnames' as one of its field. > save (data, file = ..) > > Now I want to read this data file again as GRanges object. > > Is there any built-in method to do that? ( 'as' method in GRanges is > currently is not able to do it.) > > Alternatively, I tried to convert the data frame into RangedData first and > then into GRanges. But the conversion into RangedData assumed the > 'seqnames' as metadata and by default inserted 'space' with value '1' in > IRanges. This conversion cast a problem for GRanges coersion as 'seqnames' > are now in the metadata field which is reserved keyword and thus producing > error. > > > > library(IRanges)> library(GenomicRanges)> load('/home/fahim/hg19/GeneName.RData')> head(data) seqnames start end width strand name > 1 Chr1 10954 11507 554 + LOC100506145 > 2 Chr1 12190 13639 1450 + LOC100652771 > 3 Chr1 14362 29370 15009 - WASH7P > 4 Chr1 30366 30503 138 + MIR1302-2 > 5 Chr1 34611 36081 1471 - FAM138A > 6 Chr1 52453 53396 944 + OR4G4P> rd1 = as(data, "RangedData")> rd1RangedData with 36590 rows and 3 value columns across 1 space > space ranges | seqnames strand name > <factor> <iranges> | <factor> <factor> <factor> > 1 1 [ 10954, 11507] | Chr1 + LOC100506145 > 2 1 [ 12190, 13639] | Chr1 + LOC100652771 > 3 1 [ 14362, 29370] | Chr1 - WASH7P > 4 1 [ 30366, 30503] | Chr1 + MIR1302-2 > 5 1 [ 34611, 36081] | Chr1 - FAM138A > 6 1 [ 52453, 53396] | Chr1 + OR4G4P > 7 1 [ 63016, 63885] | Chr1 + OR4G11P > 8 1 [ 69091, 70008] | Chr1 + OR4F5 > 9 1 [131125, 135677] | Chr1 + LOC100420257 > ... ... ... ... ... ... ... > 36582 1 [59100457, 59115123] | ChrY + SPRY3 > 36583 1 [59160762, 59162330] | ChrY - AMDP1 > 36584 1 [59213949, 59276439] | ChrY + VAMP7 > 36585 1 [59311663, 59311996] | ChrY - TCEB1P24 > 36586 1 [59318017, 59318918] | ChrY - TRPC6P > 36587 1 [59330252, 59343488] | ChrY + IL9R > 36588 1 [59354329, 59358343] | ChrY + WASH6P > 36589 1 [59358332, 59360854] | ChrY - DDX11L16 > 36590 1 [59361222, 59361778] | ChrY - LOC100507426> rd2 = as(rd1, "GRanges")Error in validObject(.Object) : > invalid class "GRanges" object: slot 'elementMetadata' cannot use "seqnames", "ranges", "strand", "seqlevels", "seqlengths", "isCircular", "genome", "start", "end", "width", "element" as column names > > > > -- [[alternative HTML version deleted]] ADD COMMENT 0 Entering edit mode It's difficult in general to have a coercion method from a less structured data type (like a data frame) to one more structured/constrained (like a GRanges). There's just no conventions as to how data is stored in the data frame. The RangedData coercion is not biology-aware, so it is not going to interact well with something output from GenomicRanges. I would recommend the approach you took. It can be made a little cleaner via with(). I've been messing around with an import.trackTable in rtracklayer. It allows the client to specify the positions of the seqnames, start and end in a table. Currently it is internal though, in support of import of generic tabix data. Michael On Sat, Nov 26, 2011 at 1:40 PM, Fahim Mohammad <fahim.md@gmail.com> wrote: > Brute force method to do the same. > > load('/home/fahim/rugit/rangeData/hg19/GeneName.RData') > head(data) > fldslen = length(names(data)) #how many fields > gr = GRanges(seqnames = Rle(data$seqnames), IRanges(data$start, data$end), > strand = Rle(as.character(data$strand)), name = data$name ) > if (fldslen > 6) > { > restfldnames = setdiff(names(data), c('seqnames', 'start', 'end', > 'strand', 'width', 'name')) > elementMetadata(gr) = datarest > } > > > > 2011/11/26 Fahim Mohammad <fahim.md@gmail.com> > > > Hi > > > > I generated a data frame from GRanges object by performing > > > > data = as.data.frame (grObj, row.names =NULL); #grObject is GRanges > > object having 'seqnames' as one of its field. > > save (data, file = ..) > > > > Now I want to read this data file again as GRanges object. > > > > Is there any built-in method to do that? ( 'as' method in GRanges is > > currently is not able to do it.) > > > > Alternatively, I tried to convert the data frame into RangedData first > and > > then into GRanges. But the conversion into RangedData assumed the > > 'seqnames' as metadata and by default inserted 'space' with value '1' in > > IRanges. This conversion cast a problem for GRanges coersion as > 'seqnames' > > are now in the metadata field which is reserved keyword and thus > producing > > error. > > > > > > > library(IRanges)> library(GenomicRanges)> > load('/home/fahim/hg19/GeneName.RData')> head(data) seqnames start end > width strand name > > 1 Chr1 10954 11507 554 + LOC100506145 > > 2 Chr1 12190 13639 1450 + LOC100652771 > > 3 Chr1 14362 29370 15009 - WASH7P > > 4 Chr1 30366 30503 138 + MIR1302-2 > > 5 Chr1 34611 36081 1471 - FAM138A > > 6 Chr1 52453 53396 944 + OR4G4P> rd1 = as(data, > "RangedData")> rd1RangedData with 36590 rows and 3 value columns across 1 > space > > space ranges | seqnames strand name > > <factor> <iranges> | <factor> <factor> <factor> > > 1 1 [ 10954, 11507] | Chr1 + LOC100506145 > > 2 1 [ 12190, 13639] | Chr1 + LOC100652771 > > 3 1 [ 14362, 29370] | Chr1 - WASH7P > > 4 1 [ 30366, 30503] | Chr1 + MIR1302-2 > > 5 1 [ 34611, 36081] | Chr1 - FAM138A > > 6 1 [ 52453, 53396] | Chr1 + OR4G4P > > 7 1 [ 63016, 63885] | Chr1 + OR4G11P > > 8 1 [ 69091, 70008] | Chr1 + OR4F5 > > 9 1 [131125, 135677] | Chr1 + LOC100420257 > > ... ... ... ... ... ... ... > > 36582 1 [59100457, 59115123] | ChrY + SPRY3 > > 36583 1 [59160762, 59162330] | ChrY - AMDP1 > > 36584 1 [59213949, 59276439] | ChrY + VAMP7 > > 36585 1 [59311663, 59311996] | ChrY - TCEB1P24 > > 36586 1 [59318017, 59318918] | ChrY - TRPC6P > > 36587 1 [59330252, 59343488] | ChrY + IL9R > > 36588 1 [59354329, 59358343] | ChrY + WASH6P > > 36589 1 [59358332, 59360854] | ChrY - DDX11L16 > > 36590 1 [59361222, 59361778] | ChrY - LOC100507426> > rd2 = as(rd1, "GRanges")Error in validObject(.Object) : > > invalid class "GRanges" object: slot 'elementMetadata' cannot use > "seqnames", "ranges", "strand", "seqlevels", "seqlengths", "isCircular", > "genome", "start", "end", "width", "element" as column names > > > > > > > > > > > -- > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
0
Entering edit mode
Hi, On Sat, Nov 26, 2011 at 6:56 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > It's difficult in general to have a coercion method from a less structured > data type (like a data frame) to one more structured/constrained (like a > GRanges). There's just no conventions as to how data is stored in the data > frame. The RangedData coercion is not biology-aware, so it is not going to > interact well with something output from GenomicRanges. I would recommend > the approach you took. It can be made a little cleaner via with(). While all of what Michael says is true, sometimes you just want to shoot from GRanges <--> data.frame and back again rather easily. I've defined my own setAs methods for this purpose which you can use here: https://github.com/lianos/seqtools/blob/master/R/pkg/R/conversions.R There is really no error checking going on between the conversions, but if you have columns of "the right" names in a data.frame, it'll convert your data.frame to a GRanges object with no fuss. Note that if your data.frame has a start, end, and width column it will ignore the width and use start/end. All other "non-GRanges" columns will be converted into a DataFrame object and stuffed into the values() slot of the GRanges object returned. The usual "use at your own risk" disclaimer applies ... Enjoy, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
0
Entering edit mode
For completion, here is my data.frame2GRanges function. I only ever translate from dataframes to GRanges and I almost never bother to keep anything but the location (no metadata). I also have an option for taking a stranded data.frame and turning it into an unstranded GRanges. I discussed conversion functions like this on the list with Martin Morgan a long time ago and he thought it would be better to leave them out from GRanges. Having said that, the data.frame2GRanges function below (with its default options), is flat out one of the most used functions in my R scripts. It is amazing how much I use this simple function. keepColumns indicate whether additional data frame columns should be put into the GRanges. ignoreStrand makes the strand of the constructed GRanges equal to *. It assumes the input data.frame has columns chr/seqnames, start, end. data.frame2GRanges <- function(df, keepColumns = FALSE, ignoreStrand = FALSE) { stopifnot(class(df) == "data.frame") stopifnot(all(c("start", "end") %in% names(df))) stopifnot(any(c("chr", "seqnames") %in% names(df))) if("seqnames" %in% names(df)) names(df)[names(df) == "seqnames"] <- "chr" if(!ignoreStrand && "strand" %in% names(df)) { if(is.numeric(df$strand)) { strand <- ifelse(df$strand == 1, "+", "*") strand[df$strand == -1] <- "-" df$strand <- strand } gr <- GRanges(seqnames = df$chr, ranges = IRanges(start = df$start, end = df$end), strand = df$strand) } else { gr <- GRanges(seqnames = df$chr, ranges = IRanges(start = df$start, end = df\$end)) } if(keepColumns) { dt <- as(df[, setdiff(names(df), c("chr", "start", "end", "strand"))], "DataFrame") elementMetadata(gr) <- dt } names(gr) <- rownames(df) gr } Kasper On Mon, Nov 28, 2011 at 1:10 AM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > On Sat, Nov 26, 2011 at 6:56 PM, Michael Lawrence > <lawrence.michael at="" gene.com=""> wrote: >> It's difficult in general to have a coercion method from a less structured >> data type (like a data frame) to one more structured/constrained (like a >> GRanges). There's just no conventions as to how data is stored in the data >> frame. The RangedData coercion is not biology-aware, so it is not going to >> interact well with something output from GenomicRanges. I would recommend >> the approach you took. It can be made a little cleaner via with(). > > While all of what Michael says is true, sometimes you just want to > shoot from GRanges <--> data.frame and back again rather easily. > > I've defined my own setAs methods for this purpose which you can use here: > > https://github.com/lianos/seqtools/blob/master/R/pkg/R/conversions.R > > There is really no error checking going on between the conversions, > but if you have columns of "the right" names in a data.frame, it'll > convert your data.frame to a GRanges object with no fuss. > > Note that if your data.frame has a start, end, and width column it > will ignore the width and use start/end. All other "non-GRanges" > columns will be converted into a DataFrame object and stuffed into the > values() slot of the GRanges object returned. > > The usual "use at your own risk" disclaimer applies ... > > Enjoy, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >