Easy way to convert CharacterList to character, collapsing each element?

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 9 months ago

Scripps Research, La Jolla, CA

Hi all, I have some annotation data in a DataFrame, and of course since annotations are not one-to-one, some of the columns are CharacterList or similar classes. I would like to know if there is an efficient way to collapse a CharacterList to a character vector of the same length, such that for elements of length > 1, those elements are collapsed with a given separator. The following is what I came up with, but it is very slow for large CharacterLists: library(stringr) library(plyr) flatten.CharacterList <- function(x, sep=",") { if (is.list(x)) { x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, .parallel=TRUE) x <- as(x, "character") } x } -Ryan

Annotation Annotation • 951 views

ADD COMMENT • link updated 10.4 years ago by Hervé Pagès 16k • written 10.4 years ago by Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 3 days ago

Seattle, WA, United States

Hi Ryan, Here is one way to do this using Biostrings: library(Biostrings) strunsplit <- function(x, sep=",") { if (!is(x, "XStringSetList")) x <- Biostrings:::XStringSetList("B", x) if (!isSingleString(sep)) stop("'sep' must be a single character string") ## unlist twice. unlisted_x <- unlist(x, use.names=FALSE) unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) ## insert 'seq'. unlisted_x_width <- width(unlisted_x) x_partitioning <- PartitioningByEnd(x) at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) ## relist. ans_width <- sum(relist(unlisted_x_width, x_partitioning)) x_eltlens <- width(x_partitioning) idx <- which(x_eltlens >= 2L) ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) relist(unlisted_ans, PartitioningByWidth(ans_width)) } Then: > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", D=c("id2", "id4")) > strunsplit(x) A BStringSet instance of length 4 width seq names [1] 13 id35,id2,id18 A [2] 0 B [3] 3 id4 C [4] 7 id2,id4 D I'll add this to Biostrings. Cheers, H. On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: > Hi all, > > I have some annotation data in a DataFrame, and of course since > annotations are not one-to-one, some of the columns are CharacterList or > similar classes. I would like to know if there is an efficient way to > collapse a CharacterList to a character vector of the same length, such > that for elements of length > 1, those elements are collapsed with a > given separator. The following is what I came up with, but it is very > slow for large CharacterLists: > > library(stringr) > library(plyr) > flatten.CharacterList <- function(x, sep=",") { > if (is.list(x)) { > x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, > .parallel=TRUE) > x <- as(x, "character") > } > x > } > > -Ryan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 10.4 years ago Hervé Pagès 16k

0

Entering edit mode

Forgot to say that the solution below only works with BioC-devel. H. On 12/16/2013 04:16 PM, Hervé Pagès wrote: > Hi Ryan, > > Here is one way to do this using Biostrings: > > library(Biostrings) > > strunsplit <- function(x, sep=",") > { > if (!is(x, "XStringSetList")) > x <- Biostrings:::XStringSetList("B", x) > if (!isSingleString(sep)) > stop("'sep' must be a single character string") > > ## unlist twice. > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) > > ## insert 'seq'. > unlisted_x_width <- width(unlisted_x) > x_partitioning <- PartitioningByEnd(x) > at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L > unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) > > ## relist. > ans_width <- sum(relist(unlisted_x_width, x_partitioning)) > x_eltlens <- width(x_partitioning) > idx <- which(x_eltlens >= 2L) > ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) > relist(unlisted_ans, PartitioningByWidth(ans_width)) > } > > Then: > > > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", > D=c("id2", "id4")) > > strunsplit(x) > A BStringSet instance of length 4 > width seq names > [1] 13 id35,id2,id18 A > [2] 0 B > [3] 3 id4 C > [4] 7 id2,id4 D > > I'll add this to Biostrings. > > Cheers, > H. > > > On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> Hi all, >> >> I have some annotation data in a DataFrame, and of course since >> annotations are not one-to-one, some of the columns are CharacterList or >> similar classes. I would like to know if there is an efficient way to >> collapse a CharacterList to a character vector of the same length, such >> that for elements of length > 1, those elements are collapsed with a >> given separator. The following is what I came up with, but it is very >> slow for large CharacterLists: >> >> library(stringr) >> library(plyr) >> flatten.CharacterList <- function(x, sep=",") { >> if (is.list(x)) { >> x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, >> .parallel=TRUE) >> x <- as(x, "character") >> } >> x >> } >> >> -Ryan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.4 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks! I look forward to seeing this in the next release. On 12/16/2013 04:16 PM, Hervé Pagès wrote: > Hi Ryan, > > Here is one way to do this using Biostrings: > > library(Biostrings) > > strunsplit <- function(x, sep=",") > { > if (!is(x, "XStringSetList")) > x <- Biostrings:::XStringSetList("B", x) > if (!isSingleString(sep)) > stop("'sep' must be a single character string") > > ## unlist twice. > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) > > ## insert 'seq'. > unlisted_x_width <- width(unlisted_x) > x_partitioning <- PartitioningByEnd(x) > at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L > unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) > > ## relist. > ans_width <- sum(relist(unlisted_x_width, x_partitioning)) > x_eltlens <- width(x_partitioning) > idx <- which(x_eltlens >= 2L) > ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) > relist(unlisted_ans, PartitioningByWidth(ans_width)) > } > > Then: > > > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", > D=c("id2", "id4")) > > strunsplit(x) > A BStringSet instance of length 4 > width seq names > [1] 13 id35,id2,id18 A > [2] 0 B > [3] 3 id4 C > [4] 7 id2,id4 D > > I'll add this to Biostrings. > > Cheers, > H. > > > On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> Hi all, >> >> I have some annotation data in a DataFrame, and of course since >> annotations are not one-to-one, some of the columns are CharacterList or >> similar classes. I would like to know if there is an efficient way to >> collapse a CharacterList to a character vector of the same length, such >> that for elements of length > 1, those elements are collapsed with a >> given separator. The following is what I came up with, but it is very >> slow for large CharacterLists: >> >> library(stringr) >> library(plyr) >> flatten.CharacterList <- function(x, sep=",") { >> if (is.list(x)) { >> x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, >> .parallel=TRUE) >> x <- as(x, "character") >> } >> x >> } >> >> -Ryan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.4 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

There is a function in rtracklayer called pasteCollapse. It is hidden behind the namespace but it does exactly what you want. Just use ":::". Implemented in C for speed, and arguably simpler than the R one suggested in this thread. It just yields a character vector, not a Biostrings container, so maybe it could be pushed down into IRanges? Michael On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct@thompsonclan.org>wrote: > Thanks! I look forward to seeing this in the next release. > > > > On 12/16/2013 04:16 PM, Hervé Pagès wrote: > >> Hi Ryan, >> >> Here is one way to do this using Biostrings: >> >> library(Biostrings) >> >> strunsplit <- function(x, sep=",") >> { >> if (!is(x, "XStringSetList")) >> x <- Biostrings:::XStringSetList("B", x) >> if (!isSingleString(sep)) >> stop("'sep' must be a single character string") >> >> ## unlist twice. >> unlisted_x <- unlist(x, use.names=FALSE) >> unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) >> >> ## insert 'seq'. >> unlisted_x_width <- width(unlisted_x) >> x_partitioning <- PartitioningByEnd(x) >> at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L >> unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) >> >> ## relist. >> ans_width <- sum(relist(unlisted_x_width, x_partitioning)) >> x_eltlens <- width(x_partitioning) >> idx <- which(x_eltlens >= 2L) >> ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) >> relist(unlisted_ans, PartitioningByWidth(ans_width)) >> } >> >> Then: >> >> > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", >> D=c("id2", "id4")) >> > strunsplit(x) >> A BStringSet instance of length 4 >> width seq names >> [1] 13 id35,id2,id18 A >> [2] 0 B >> [3] 3 id4 C >> [4] 7 id2,id4 D >> >> I'll add this to Biostrings. >> >> Cheers, >> H. >> >> >> On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> >>> Hi all, >>> >>> I have some annotation data in a DataFrame, and of course since >>> annotations are not one-to-one, some of the columns are CharacterList or >>> similar classes. I would like to know if there is an efficient way to >>> collapse a CharacterList to a character vector of the same length, such >>> that for elements of length > 1, those elements are collapsed with a >>> given separator. The following is what I came up with, but it is very >>> slow for large CharacterLists: >>> >>> library(stringr) >>> library(plyr) >>> flatten.CharacterList <- function(x, sep=",") { >>> if (is.list(x)) { >>> x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, >>> .parallel=TRUE) >>> x <- as(x, "character") >>> } >>> x >>> } >>> >>> -Ryan >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, On 12/16/2013 05:15 PM, Michael Lawrence wrote: > There is a function in rtracklayer called pasteCollapse. It is hidden > behind the namespace but it does exactly what you want. Just use ":::". > Implemented in C for speed, and arguably simpler than the R one > suggested in this thread. It just yields a character vector, not a > Biostrings container, so maybe it could be pushed down into IRanges? Or we could make strunsplit() a generic function and have 2 methods: - One for CharacterList objects that returns a character vector. Would be in IRanges and would use the pasteCollapse C code (after we move it to IRanges). - One for XStringSetList objects that returns an XStringSet object. Would be in Biostrings. With the implementation I gave earlier (based on the unlist/relist trick) it's almost as fast as pasteCollapse but it would be easy to implement it in C to make it even faster. The mapping between the input and output types of strunsplit() is the same as with unlist() or [[. H. > > Michael > > > On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> <mailto:rct at="" thompsonclan.org="">> wrote: > > Thanks! I look forward to seeing this in the next release. > > > > On 12/16/2013 04:16 PM, Hervé Pagès wrote: > > Hi Ryan, > > Here is one way to do this using Biostrings: > > library(Biostrings) > > strunsplit <- function(x, sep=",") > { > if (!is(x, "XStringSetList")) > x <- Biostrings:::XStringSetList("__B", x) > if (!isSingleString(sep)) > stop("'sep' must be a single character string") > > ## unlist twice. > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) > > ## insert 'seq'. > unlisted_x_width <- width(unlisted_x) > x_partitioning <- PartitioningByEnd(x) > at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L > unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) > > ## relist. > ans_width <- sum(relist(unlisted_x_width, x_partitioning)) > x_eltlens <- width(x_partitioning) > idx <- which(x_eltlens >= 2L) > ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * > nchar(sep) > relist(unlisted_ans, PartitioningByWidth(ans_width)__) > } > > Then: > > > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, > C="id4", D=c("id2", "id4")) > > strunsplit(x) > A BStringSet instance of length 4 > width seq names > [1] 13 id35,id2,id18 A > [2] 0 B > [3] 3 id4 C > [4] 7 id2,id4 D > > I'll add this to Biostrings. > > Cheers, > H. > > > On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: > > Hi all, > > I have some annotation data in a DataFrame, and of course since > annotations are not one-to-one, some of the columns are > CharacterList or > similar classes. I would like to know if there is an > efficient way to > collapse a CharacterList to a character vector of the same > length, such > that for elements of length > 1, those elements are > collapsed with a > given separator. The following is what I came up with, but > it is very > slow for large CharacterLists: > > library(stringr) > library(plyr) > flatten.CharacterList <- function(x, sep=",") { > if (is.list(x)) { > x[!is.na <http: is.na="">(x)] <- laply(x[!is.na > <http: is.na="">(x)], str_c, collapse=sep, > .parallel=TRUE) > x <- as(x, "character") > } > x > } > > -Ryan > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.4 years ago Hervé Pagès 16k

0

Entering edit mode

The generic is a good idea. On Mon, Dec 16, 2013 at 10:35 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Michael, > > > On 12/16/2013 05:15 PM, Michael Lawrence wrote: > >> There is a function in rtracklayer called pasteCollapse. It is hidden >> behind the namespace but it does exactly what you want. Just use ":::". >> Implemented in C for speed, and arguably simpler than the R one >> suggested in this thread. It just yields a character vector, not a >> Biostrings container, so maybe it could be pushed down into IRanges? >> > > Or we could make strunsplit() a generic function and have 2 > methods: > > - One for CharacterList objects that returns a character vector. > Would be in IRanges and would use the pasteCollapse C code (after > we move it to IRanges). > > - One for XStringSetList objects that returns an XStringSet object. > Would be in Biostrings. With the implementation I gave earlier > (based on the unlist/relist trick) it's almost as fast as > pasteCollapse but it would be easy to implement it in C to make > it even faster. > > The mapping between the input and output types of strunsplit() is the > same as with unlist() or [[. > > H. > > >> Michael >> >> >> On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct@thompsonclan.org>> <mailto:rct@thompsonclan.org>> wrote: >> >> Thanks! I look forward to seeing this in the next release. >> >> >> >> On 12/16/2013 04:16 PM, Hervé Pagès wrote: >> >> Hi Ryan, >> >> Here is one way to do this using Biostrings: >> >> library(Biostrings) >> >> strunsplit <- function(x, sep=",") >> { >> if (!is(x, "XStringSetList")) >> x <- Biostrings:::XStringSetList("__B", x) >> >> if (!isSingleString(sep)) >> stop("'sep' must be a single character string") >> >> ## unlist twice. >> unlisted_x <- unlist(x, use.names=FALSE) >> unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) >> >> ## insert 'seq'. >> unlisted_x_width <- width(unlisted_x) >> x_partitioning <- PartitioningByEnd(x) >> at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L >> >> unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) >> >> ## relist. >> ans_width <- sum(relist(unlisted_x_width, x_partitioning)) >> x_eltlens <- width(x_partitioning) >> idx <- which(x_eltlens >= 2L) >> ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * >> nchar(sep) >> relist(unlisted_ans, PartitioningByWidth(ans_width)__) >> >> } >> >> Then: >> >> > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, >> C="id4", D=c("id2", "id4")) >> > strunsplit(x) >> A BStringSet instance of length 4 >> width seq names >> [1] 13 id35,id2,id18 A >> [2] 0 B >> [3] 3 id4 C >> [4] 7 id2,id4 D >> >> I'll add this to Biostrings. >> >> Cheers, >> H. >> >> >> On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> >> Hi all, >> >> I have some annotation data in a DataFrame, and of course >> since >> annotations are not one-to-one, some of the columns are >> CharacterList or >> similar classes. I would like to know if there is an >> efficient way to >> collapse a CharacterList to a character vector of the same >> length, such >> that for elements of length > 1, those elements are >> collapsed with a >> given separator. The following is what I came up with, but >> it is very >> slow for large CharacterLists: >> >> library(stringr) >> library(plyr) >> flatten.CharacterList <- function(x, sep=",") { >> if (is.list(x)) { >> x[!is.na <http: is.na="">(x)] <- laply(x[!is.na >> <http: is.na="">(x)], str_c, collapse=sep, >> >> .parallel=TRUE) >> x <- as(x, "character") >> } >> x >> } >> >> -Ryan >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>> > >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__ >> conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.="">> conductor> >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Michael Lawrence ★ 11k

Login before adding your answer.