extracting regions of consecutive values from dataframe

0

Entering edit mode

Niels Høgslund ▴ 20

@niels-hgslund-2825

Last seen 11.3 years ago

Hi, I have a lot of data frames looking like this (SNP chromosome position and a local state ID): Position State 1 3088998 0 2 4215064 6 3 5034491 6 4 5211912 6 5 5697261 6 6 5809727 0 7 6818872 NA 8 6867391 0 9 7346904 1 10 7347824 1 11 7358232 1 12 7833686 1 13 8295795 0 14 10755448 0 15 10919778 NA 16 11217061 3 17 12463350 3 18 13678626 0 19 13892992 0 20 13965452 0 21 13969222 0 ........ Now, I want to collapse or summarize consecutive occurences of a state into a region with a start+end position, i.e. something like this: Position State 2 4215064 6 5 5697261 6 9 73469041 1 12 7833686 1 16 11217061 3 17 12463350 3 Can anyone help me with this? Thanks in advance..... Niels H?gslund BiRC -Bioinformatics Research Center H?egh-Guldbergs Gade 10 DK-8000 ?rhus C Denmark phone: +45 89423100 mail: nj at birc.au.dk

• 892 views

ADD COMMENT • link updated 17.5 years ago by Hervé Pagès 16k • written 17.5 years ago by Niels Høgslund ▴ 20

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 9 months ago

United States

On Fri, May 30, 2008 at 6:35 AM, Niels H?gslund <nj at="" birc.au.dk=""> wrote: > Hi, > > I have a lot of data frames looking like this (SNP chromosome position and a > local state ID): > > Position State > 1 3088998 0 > 2 4215064 6 > 3 5034491 6 > 4 5211912 6 > 5 5697261 6 > 6 5809727 0 > 7 6818872 NA > 8 6867391 0 > 9 7346904 1 > 10 7347824 1 > 11 7358232 1 > 12 7833686 1 > 13 8295795 0 > 14 10755448 0 > 15 10919778 NA > 16 11217061 3 > 17 12463350 3 > 18 13678626 0 > 19 13892992 0 > 20 13965452 0 > 21 13969222 0 > ........ > > Now, I want to collapse or summarize consecutive occurences of a state into > a region with a start+end position, > i.e. something like this: > > Position State > 2 4215064 6 > 5 5697261 6 > 9 73469041 1 > 12 7833686 1 > 16 11217061 3 > 17 12463350 3 > > Can anyone help me with this? The rle() function is one way to do this. You will need to write a little wrapper function to do exactly what you want, but rle() should get you going. Sean

ADD COMMENT • link 17.5 years ago Sean Davis 21k

0

Entering edit mode

On Fri, 30 May 2008, Sean Davis wrote: > > On Fri, May 30, 2008 at 6:35 AM, Niels H??gslund <nj at="" birc.au.dk=""> wrote: > > Hi, > > > > I have a lot of data frames looking like this (SNP chromosome position and a > > local state ID): > > > > Position State > > 1 3088998 0 > > 2 4215064 6 > > 3 5034491 6 > > 4 5211912 6 > > 5 5697261 6 > > 6 5809727 0 > > 7 6818872 NA > > 8 6867391 0 > > 9 7346904 1 > > 10 7347824 1 > > 11 7358232 1 > > 12 7833686 1 > > 13 8295795 0 > > 14 10755448 0 > > 15 10919778 NA > > 16 11217061 3 > > 17 12463350 3 > > 18 13678626 0 > > 19 13892992 0 > > 20 13965452 0 > > 21 13969222 0 > > ........ > > > > Now, I want to collapse or summarize consecutive occurences of a state into > > a region with a start+end position, > > i.e. something like this: > > > > Position State > > 2 4215064 6 > > 5 5697261 6 > > 9 73469041 1 > > 12 7833686 1 > > 16 11217061 3 > > 17 12463350 3 > > > > Can anyone help me with this? > > The rle() function is one way to do this. You will need to write a > little wrapper function to do exactly what you want, but rle() should > get you going. > > Sean > Indeed. It seems that the combination of rle and split will do the job -- but split reorders the data, so we have a subroutine split.preserveord in s2reg below. The following dputs the example and the code with the print of the result. There must be a better way. > dput(y) structure(list(Position = c(3088998L, 4215064L, 5034491L, 5211912L, 5697261L, 5809727L, 6867391L, 7346904L, 7347824L, 7358232L, 7833686L, 8295795L, 10755448L, 11217061L, 12463350L, 13678626L, 13892992L, 13965452L, 13969222L), State = c(0L, 6L, 6L, 6L, 6L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 3L, 3L, 0L, 0L, 0L, 0L)), .Names = c("Position", "State"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L), class = "data.frame", na.action = structure(c(7L, 8L, 16L), .Names = c("7", "8", "16"), class = "omit")) > dput(s2reg) function (df) { # assumes df[,1] is position, df[,2] is state # and contiguous rows sharing value of state are to be grouped # first a tweak of split() split.preserveord = function(x, disc) { tmpn <- unique(disc) ansscr <- split(x, disc) ord <- match(tmpn, names(ansscr)) ansscr[ord] } # now get rle of state rr = rle(df[, 2]) # etc. tags = make.unique(as.character(rr$value)) ftags = rep(tags, rr$len) sdf = split.preserveord(df, ftags) ra = sapply(sdf, function(x) range(x[, 1])) rownames(ra) = c("start", "end") cbind(t(ra), state = rr$val) } > s2reg(y) start end state 0 3088998 3088998 0 6 4215064 5697261 6 0.1 5809727 6867391 0 1 7346904 7833686 1 0.2 8295795 10755448 0 3 11217061 12463350 3 0.3 13678626 13969222 0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -------------- next part -------------- The information transmitted in this electronic communica...{{dropped:18}}

ADD REPLY • link 17.5 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Niels, You can do this: df0 <- data.frame( Position=c(2, 5, 8, 9, 15, 17, 20, 21, 24, 25), State=as.character(c(0, 6, 6, 6, 1, 1, 0, 3, 3, 2)) ) x <- split(df0$Position, df0$State) df1 <- data.frame(start=sapply(x, min), end=sapply(x, max), State=names(x)) Now 'df1' contains one row per state with the 'start' and 'end' positions for this state: > df1 start end State 0 2 20 0 1 15 17 1 2 25 25 2 3 21 24 3 6 5 9 6 Note that state 0 seems to be special in your data because the positions at which it occurs are interlaced with the positions at which other states occur. Cheers, H. Niels H?gslund wrote: > Hi, > > I have a lot of data frames looking like this (SNP chromosome position > and a local state ID): > > Position State > 1 3088998 0 > 2 4215064 6 > 3 5034491 6 > 4 5211912 6 > 5 5697261 6 > 6 5809727 0 > 7 6818872 NA > 8 6867391 0 > 9 7346904 1 > 10 7347824 1 > 11 7358232 1 > 12 7833686 1 > 13 8295795 0 > 14 10755448 0 > 15 10919778 NA > 16 11217061 3 > 17 12463350 3 > 18 13678626 0 > 19 13892992 0 > 20 13965452 0 > 21 13969222 0 > ........ > > Now, I want to collapse or summarize consecutive occurences of a state > into a region with a start+end position, > i.e. something like this: > > Position State > 2 4215064 6 > 5 5697261 6 > 9 73469041 1 > 12 7833686 1 > 16 11217061 3 > 17 12463350 3 > > Can anyone help me with this? > > Thanks in advance..... > > > > Niels H?gslund > BiRC -Bioinformatics Research Center > H?egh-Guldbergs Gade 10 > DK-8000 ?rhus C > Denmark > phone: +45 89423100 > mail: nj at birc.au.dk > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 17.5 years ago Hervé Pagès 16k

Login before adding your answer.