Easy way to convert CharacterList to character, collapsing each element?
1
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 9 months ago
Scripps Research, La Jolla, CA
Hi all, I have some annotation data in a DataFrame, and of course since annotations are not one-to-one, some of the columns are CharacterList or similar classes. I would like to know if there is an efficient way to collapse a CharacterList to a character vector of the same length, such that for elements of length > 1, those elements are collapsed with a given separator. The following is what I came up with, but it is very slow for large CharacterLists: library(stringr) library(plyr) flatten.CharacterList <- function(x, sep=",") { if (is.list(x)) { x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, .parallel=TRUE) x <- as(x, "character") } x } -Ryan
Annotation Annotation • 951 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States
Hi Ryan, Here is one way to do this using Biostrings: library(Biostrings) strunsplit <- function(x, sep=",") { if (!is(x, "XStringSetList")) x <- Biostrings:::XStringSetList("B", x) if (!isSingleString(sep)) stop("'sep' must be a single character string") ## unlist twice. unlisted_x <- unlist(x, use.names=FALSE) unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) ## insert 'seq'. unlisted_x_width <- width(unlisted_x) x_partitioning <- PartitioningByEnd(x) at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) ## relist. ans_width <- sum(relist(unlisted_x_width, x_partitioning)) x_eltlens <- width(x_partitioning) idx <- which(x_eltlens >= 2L) ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) relist(unlisted_ans, PartitioningByWidth(ans_width)) } Then: > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", D=c("id2", "id4")) > strunsplit(x) A BStringSet instance of length 4 width seq names [1] 13 id35,id2,id18 A [2] 0 B [3] 3 id4 C [4] 7 id2,id4 D I'll add this to Biostrings. Cheers, H. On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: > Hi all, > > I have some annotation data in a DataFrame, and of course since > annotations are not one-to-one, some of the columns are CharacterList or > similar classes. I would like to know if there is an efficient way to > collapse a CharacterList to a character vector of the same length, such > that for elements of length > 1, those elements are collapsed with a > given separator. The following is what I came up with, but it is very > slow for large CharacterLists: > > library(stringr) > library(plyr) > flatten.CharacterList <- function(x, sep=",") { > if (is.list(x)) { > x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, > .parallel=TRUE) > x <- as(x, "character") > } > x > } > > -Ryan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Forgot to say that the solution below only works with BioC-devel. H. On 12/16/2013 04:16 PM, Hervé Pagès wrote: > Hi Ryan, > > Here is one way to do this using Biostrings: > > library(Biostrings) > > strunsplit <- function(x, sep=",") > { > if (!is(x, "XStringSetList")) > x <- Biostrings:::XStringSetList("B", x) > if (!isSingleString(sep)) > stop("'sep' must be a single character string") > > ## unlist twice. > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) > > ## insert 'seq'. > unlisted_x_width <- width(unlisted_x) > x_partitioning <- PartitioningByEnd(x) > at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L > unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) > > ## relist. > ans_width <- sum(relist(unlisted_x_width, x_partitioning)) > x_eltlens <- width(x_partitioning) > idx <- which(x_eltlens >= 2L) > ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) > relist(unlisted_ans, PartitioningByWidth(ans_width)) > } > > Then: > > > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", > D=c("id2", "id4")) > > strunsplit(x) > A BStringSet instance of length 4 > width seq names > [1] 13 id35,id2,id18 A > [2] 0 B > [3] 3 id4 C > [4] 7 id2,id4 D > > I'll add this to Biostrings. > > Cheers, > H. > > > On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> Hi all, >> >> I have some annotation data in a DataFrame, and of course since >> annotations are not one-to-one, some of the columns are CharacterList or >> similar classes. I would like to know if there is an efficient way to >> collapse a CharacterList to a character vector of the same length, such >> that for elements of length > 1, those elements are collapsed with a >> given separator. The following is what I came up with, but it is very >> slow for large CharacterLists: >> >> library(stringr) >> library(plyr) >> flatten.CharacterList <- function(x, sep=",") { >> if (is.list(x)) { >> x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, >> .parallel=TRUE) >> x <- as(x, "character") >> } >> x >> } >> >> -Ryan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thanks! I look forward to seeing this in the next release. On 12/16/2013 04:16 PM, Hervé Pagès wrote: > Hi Ryan, > > Here is one way to do this using Biostrings: > > library(Biostrings) > > strunsplit <- function(x, sep=",") > { > if (!is(x, "XStringSetList")) > x <- Biostrings:::XStringSetList("B", x) > if (!isSingleString(sep)) > stop("'sep' must be a single character string") > > ## unlist twice. > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) > > ## insert 'seq'. > unlisted_x_width <- width(unlisted_x) > x_partitioning <- PartitioningByEnd(x) > at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L > unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) > > ## relist. > ans_width <- sum(relist(unlisted_x_width, x_partitioning)) > x_eltlens <- width(x_partitioning) > idx <- which(x_eltlens >= 2L) > ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) > relist(unlisted_ans, PartitioningByWidth(ans_width)) > } > > Then: > > > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", > D=c("id2", "id4")) > > strunsplit(x) > A BStringSet instance of length 4 > width seq names > [1] 13 id35,id2,id18 A > [2] 0 B > [3] 3 id4 C > [4] 7 id2,id4 D > > I'll add this to Biostrings. > > Cheers, > H. > > > On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> Hi all, >> >> I have some annotation data in a DataFrame, and of course since >> annotations are not one-to-one, some of the columns are CharacterList or >> similar classes. I would like to know if there is an efficient way to >> collapse a CharacterList to a character vector of the same length, such >> that for elements of length > 1, those elements are collapsed with a >> given separator. The following is what I came up with, but it is very >> slow for large CharacterLists: >> >> library(stringr) >> library(plyr) >> flatten.CharacterList <- function(x, sep=",") { >> if (is.list(x)) { >> x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, >> .parallel=TRUE) >> x <- as(x, "character") >> } >> x >> } >> >> -Ryan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
There is a function in rtracklayer called pasteCollapse. It is hidden behind the namespace but it does exactly what you want. Just use ":::". Implemented in C for speed, and arguably simpler than the R one suggested in this thread. It just yields a character vector, not a Biostrings container, so maybe it could be pushed down into IRanges? Michael On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct@thompsonclan.org>wrote: > Thanks! I look forward to seeing this in the next release. > > > > On 12/16/2013 04:16 PM, Hervé Pagès wrote: > >> Hi Ryan, >> >> Here is one way to do this using Biostrings: >> >> library(Biostrings) >> >> strunsplit <- function(x, sep=",") >> { >> if (!is(x, "XStringSetList")) >> x <- Biostrings:::XStringSetList("B", x) >> if (!isSingleString(sep)) >> stop("'sep' must be a single character string") >> >> ## unlist twice. >> unlisted_x <- unlist(x, use.names=FALSE) >> unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) >> >> ## insert 'seq'. >> unlisted_x_width <- width(unlisted_x) >> x_partitioning <- PartitioningByEnd(x) >> at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L >> unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) >> >> ## relist. >> ans_width <- sum(relist(unlisted_x_width, x_partitioning)) >> x_eltlens <- width(x_partitioning) >> idx <- which(x_eltlens >= 2L) >> ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep) >> relist(unlisted_ans, PartitioningByWidth(ans_width)) >> } >> >> Then: >> >> > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", >> D=c("id2", "id4")) >> > strunsplit(x) >> A BStringSet instance of length 4 >> width seq names >> [1] 13 id35,id2,id18 A >> [2] 0 B >> [3] 3 id4 C >> [4] 7 id2,id4 D >> >> I'll add this to Biostrings. >> >> Cheers, >> H. >> >> >> On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> >>> Hi all, >>> >>> I have some annotation data in a DataFrame, and of course since >>> annotations are not one-to-one, some of the columns are CharacterList or >>> similar classes. I would like to know if there is an efficient way to >>> collapse a CharacterList to a character vector of the same length, such >>> that for elements of length > 1, those elements are collapsed with a >>> given separator. The following is what I came up with, but it is very >>> slow for large CharacterLists: >>> >>> library(stringr) >>> library(plyr) >>> flatten.CharacterList <- function(x, sep=",") { >>> if (is.list(x)) { >>> x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep, >>> .parallel=TRUE) >>> x <- as(x, "character") >>> } >>> x >>> } >>> >>> -Ryan >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Michael, On 12/16/2013 05:15 PM, Michael Lawrence wrote: > There is a function in rtracklayer called pasteCollapse. It is hidden > behind the namespace but it does exactly what you want. Just use ":::". > Implemented in C for speed, and arguably simpler than the R one > suggested in this thread. It just yields a character vector, not a > Biostrings container, so maybe it could be pushed down into IRanges? Or we could make strunsplit() a generic function and have 2 methods: - One for CharacterList objects that returns a character vector. Would be in IRanges and would use the pasteCollapse C code (after we move it to IRanges). - One for XStringSetList objects that returns an XStringSet object. Would be in Biostrings. With the implementation I gave earlier (based on the unlist/relist trick) it's almost as fast as pasteCollapse but it would be easy to implement it in C to make it even faster. The mapping between the input and output types of strunsplit() is the same as with unlist() or [[. H. > > Michael > > > On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct at="" thompsonclan.org=""> <mailto:rct at="" thompsonclan.org="">> wrote: > > Thanks! I look forward to seeing this in the next release. > > > > On 12/16/2013 04:16 PM, Hervé Pagès wrote: > > Hi Ryan, > > Here is one way to do this using Biostrings: > > library(Biostrings) > > strunsplit <- function(x, sep=",") > { > if (!is(x, "XStringSetList")) > x <- Biostrings:::XStringSetList("__B", x) > if (!isSingleString(sep)) > stop("'sep' must be a single character string") > > ## unlist twice. > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) > > ## insert 'seq'. > unlisted_x_width <- width(unlisted_x) > x_partitioning <- PartitioningByEnd(x) > at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L > unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) > > ## relist. > ans_width <- sum(relist(unlisted_x_width, x_partitioning)) > x_eltlens <- width(x_partitioning) > idx <- which(x_eltlens >= 2L) > ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * > nchar(sep) > relist(unlisted_ans, PartitioningByWidth(ans_width)__) > } > > Then: > > > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, > C="id4", D=c("id2", "id4")) > > strunsplit(x) > A BStringSet instance of length 4 > width seq names > [1] 13 id35,id2,id18 A > [2] 0 B > [3] 3 id4 C > [4] 7 id2,id4 D > > I'll add this to Biostrings. > > Cheers, > H. > > > On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: > > Hi all, > > I have some annotation data in a DataFrame, and of course since > annotations are not one-to-one, some of the columns are > CharacterList or > similar classes. I would like to know if there is an > efficient way to > collapse a CharacterList to a character vector of the same > length, such > that for elements of length > 1, those elements are > collapsed with a > given separator. The following is what I came up with, but > it is very > slow for large CharacterLists: > > library(stringr) > library(plyr) > flatten.CharacterList <- function(x, sep=",") { > if (is.list(x)) { > x[!is.na <http: is.na="">(x)] <- laply(x[!is.na > <http: is.na="">(x)], str_c, collapse=sep, > .parallel=TRUE) > x <- as(x, "character") > } > x > } > > -Ryan > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
The generic is a good idea. On Mon, Dec 16, 2013 at 10:35 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Michael, > > > On 12/16/2013 05:15 PM, Michael Lawrence wrote: > >> There is a function in rtracklayer called pasteCollapse. It is hidden >> behind the namespace but it does exactly what you want. Just use ":::". >> Implemented in C for speed, and arguably simpler than the R one >> suggested in this thread. It just yields a character vector, not a >> Biostrings container, so maybe it could be pushed down into IRanges? >> > > Or we could make strunsplit() a generic function and have 2 > methods: > > - One for CharacterList objects that returns a character vector. > Would be in IRanges and would use the pasteCollapse C code (after > we move it to IRanges). > > - One for XStringSetList objects that returns an XStringSet object. > Would be in Biostrings. With the implementation I gave earlier > (based on the unlist/relist trick) it's almost as fast as > pasteCollapse but it would be easy to implement it in C to make > it even faster. > > The mapping between the input and output types of strunsplit() is the > same as with unlist() or [[. > > H. > > >> Michael >> >> >> On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct@thompsonclan.org>> <mailto:rct@thompsonclan.org>> wrote: >> >> Thanks! I look forward to seeing this in the next release. >> >> >> >> On 12/16/2013 04:16 PM, Hervé Pagès wrote: >> >> Hi Ryan, >> >> Here is one way to do this using Biostrings: >> >> library(Biostrings) >> >> strunsplit <- function(x, sep=",") >> { >> if (!is(x, "XStringSetList")) >> x <- Biostrings:::XStringSetList("__B", x) >> >> if (!isSingleString(sep)) >> stop("'sep' must be a single character string") >> >> ## unlist twice. >> unlisted_x <- unlist(x, use.names=FALSE) >> unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE) >> >> ## insert 'seq'. >> unlisted_x_width <- width(unlisted_x) >> x_partitioning <- PartitioningByEnd(x) >> at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L >> >> unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep) >> >> ## relist. >> ans_width <- sum(relist(unlisted_x_width, x_partitioning)) >> x_eltlens <- width(x_partitioning) >> idx <- which(x_eltlens >= 2L) >> ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * >> nchar(sep) >> relist(unlisted_ans, PartitioningByWidth(ans_width)__) >> >> } >> >> Then: >> >> > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, >> C="id4", D=c("id2", "id4")) >> > strunsplit(x) >> A BStringSet instance of length 4 >> width seq names >> [1] 13 id35,id2,id18 A >> [2] 0 B >> [3] 3 id4 C >> [4] 7 id2,id4 D >> >> I'll add this to Biostrings. >> >> Cheers, >> H. >> >> >> On 12/16/2013 03:04 PM, Ryan C. Thompson wrote: >> >> Hi all, >> >> I have some annotation data in a DataFrame, and of course >> since >> annotations are not one-to-one, some of the columns are >> CharacterList or >> similar classes. I would like to know if there is an >> efficient way to >> collapse a CharacterList to a character vector of the same >> length, such >> that for elements of length > 1, those elements are >> collapsed with a >> given separator. The following is what I came up with, but >> it is very >> slow for large CharacterLists: >> >> library(stringr) >> library(plyr) >> flatten.CharacterList <- function(x, sep=",") { >> if (is.list(x)) { >> x[!is.na <http: is.na="">(x)] <- laply(x[!is.na >> <http: is.na="">(x)], str_c, collapse=sep, >> >> .parallel=TRUE) >> x <- as(x, "character") >> } >> x >> } >> >> -Ryan >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>> > >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__ >> conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.="">> conductor> >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 403 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6