extracting regions of consecutive values from dataframe
2
0
Entering edit mode
@niels-hgslund-2825
Last seen 11.3 years ago
Hi, I have a lot of data frames looking like this (SNP chromosome position and a local state ID): Position State 1 3088998 0 2 4215064 6 3 5034491 6 4 5211912 6 5 5697261 6 6 5809727 0 7 6818872 NA 8 6867391 0 9 7346904 1 10 7347824 1 11 7358232 1 12 7833686 1 13 8295795 0 14 10755448 0 15 10919778 NA 16 11217061 3 17 12463350 3 18 13678626 0 19 13892992 0 20 13965452 0 21 13969222 0 ........ Now, I want to collapse or summarize consecutive occurences of a state into a region with a start+end position, i.e. something like this: Position State 2 4215064 6 5 5697261 6 9 73469041 1 12 7833686 1 16 11217061 3 17 12463350 3 Can anyone help me with this? Thanks in advance..... Niels H?gslund BiRC -Bioinformatics Research Center H?egh-Guldbergs Gade 10 DK-8000 ?rhus C Denmark phone: +45 89423100 mail: nj at birc.au.dk
• 892 views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 9 months ago
United States
On Fri, May 30, 2008 at 6:35 AM, Niels H?gslund <nj at="" birc.au.dk=""> wrote: > Hi, > > I have a lot of data frames looking like this (SNP chromosome position and a > local state ID): > > Position State > 1 3088998 0 > 2 4215064 6 > 3 5034491 6 > 4 5211912 6 > 5 5697261 6 > 6 5809727 0 > 7 6818872 NA > 8 6867391 0 > 9 7346904 1 > 10 7347824 1 > 11 7358232 1 > 12 7833686 1 > 13 8295795 0 > 14 10755448 0 > 15 10919778 NA > 16 11217061 3 > 17 12463350 3 > 18 13678626 0 > 19 13892992 0 > 20 13965452 0 > 21 13969222 0 > ........ > > Now, I want to collapse or summarize consecutive occurences of a state into > a region with a start+end position, > i.e. something like this: > > Position State > 2 4215064 6 > 5 5697261 6 > 9 73469041 1 > 12 7833686 1 > 16 11217061 3 > 17 12463350 3 > > Can anyone help me with this? The rle() function is one way to do this. You will need to write a little wrapper function to do exactly what you want, but rle() should get you going. Sean
ADD COMMENT
0
Entering edit mode
On Fri, 30 May 2008, Sean Davis wrote: > > On Fri, May 30, 2008 at 6:35 AM, Niels H??gslund <nj at="" birc.au.dk=""> wrote: > > Hi, > > > > I have a lot of data frames looking like this (SNP chromosome position and a > > local state ID): > > > > Position State > > 1 3088998 0 > > 2 4215064 6 > > 3 5034491 6 > > 4 5211912 6 > > 5 5697261 6 > > 6 5809727 0 > > 7 6818872 NA > > 8 6867391 0 > > 9 7346904 1 > > 10 7347824 1 > > 11 7358232 1 > > 12 7833686 1 > > 13 8295795 0 > > 14 10755448 0 > > 15 10919778 NA > > 16 11217061 3 > > 17 12463350 3 > > 18 13678626 0 > > 19 13892992 0 > > 20 13965452 0 > > 21 13969222 0 > > ........ > > > > Now, I want to collapse or summarize consecutive occurences of a state into > > a region with a start+end position, > > i.e. something like this: > > > > Position State > > 2 4215064 6 > > 5 5697261 6 > > 9 73469041 1 > > 12 7833686 1 > > 16 11217061 3 > > 17 12463350 3 > > > > Can anyone help me with this? > > The rle() function is one way to do this. You will need to write a > little wrapper function to do exactly what you want, but rle() should > get you going. > > Sean > Indeed. It seems that the combination of rle and split will do the job -- but split reorders the data, so we have a subroutine split.preserveord in s2reg below. The following dputs the example and the code with the print of the result. There must be a better way. > dput(y) structure(list(Position = c(3088998L, 4215064L, 5034491L, 5211912L, 5697261L, 5809727L, 6867391L, 7346904L, 7347824L, 7358232L, 7833686L, 8295795L, 10755448L, 11217061L, 12463350L, 13678626L, 13892992L, 13965452L, 13969222L), State = c(0L, 6L, 6L, 6L, 6L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 3L, 3L, 0L, 0L, 0L, 0L)), .Names = c("Position", "State"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L), class = "data.frame", na.action = structure(c(7L, 8L, 16L), .Names = c("7", "8", "16"), class = "omit")) > dput(s2reg) function (df) { # assumes df[,1] is position, df[,2] is state # and contiguous rows sharing value of state are to be grouped # first a tweak of split() split.preserveord = function(x, disc) { tmpn <- unique(disc) ansscr <- split(x, disc) ord <- match(tmpn, names(ansscr)) ansscr[ord] } # now get rle of state rr = rle(df[, 2]) # etc. tags = make.unique(as.character(rr$value)) ftags = rep(tags, rr$len) sdf = split.preserveord(df, ftags) ra = sapply(sdf, function(x) range(x[, 1])) rownames(ra) = c("start", "end") cbind(t(ra), state = rr$val) } > s2reg(y) start end state 0 3088998 3088998 0 6 4215064 5697261 6 0.1 5809727 6867391 0 1 7346904 7833686 1 0.2 8295795 10755448 0 3 11217061 12463350 3 0.3 13678626 13969222 0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -------------- next part -------------- The information transmitted in this electronic communica...{{dropped:18}}
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Niels, You can do this: df0 <- data.frame( Position=c(2, 5, 8, 9, 15, 17, 20, 21, 24, 25), State=as.character(c(0, 6, 6, 6, 1, 1, 0, 3, 3, 2)) ) x <- split(df0$Position, df0$State) df1 <- data.frame(start=sapply(x, min), end=sapply(x, max), State=names(x)) Now 'df1' contains one row per state with the 'start' and 'end' positions for this state: > df1 start end State 0 2 20 0 1 15 17 1 2 25 25 2 3 21 24 3 6 5 9 6 Note that state 0 seems to be special in your data because the positions at which it occurs are interlaced with the positions at which other states occur. Cheers, H. Niels H?gslund wrote: > Hi, > > I have a lot of data frames looking like this (SNP chromosome position > and a local state ID): > > Position State > 1 3088998 0 > 2 4215064 6 > 3 5034491 6 > 4 5211912 6 > 5 5697261 6 > 6 5809727 0 > 7 6818872 NA > 8 6867391 0 > 9 7346904 1 > 10 7347824 1 > 11 7358232 1 > 12 7833686 1 > 13 8295795 0 > 14 10755448 0 > 15 10919778 NA > 16 11217061 3 > 17 12463350 3 > 18 13678626 0 > 19 13892992 0 > 20 13965452 0 > 21 13969222 0 > ........ > > Now, I want to collapse or summarize consecutive occurences of a state > into a region with a start+end position, > i.e. something like this: > > Position State > 2 4215064 6 > 5 5697261 6 > 9 73469041 1 > 12 7833686 1 > 16 11217061 3 > 17 12463350 3 > > Can anyone help me with this? > > Thanks in advance..... > > > > Niels H?gslund > BiRC -Bioinformatics Research Center > H?egh-Guldbergs Gade 10 > DK-8000 ?rhus C > Denmark > phone: +45 89423100 > mail: nj at birc.au.dk > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT

Login before adding your answer.

Traffic: 768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6