Question

Comparing DNAStringSetLists

0

Entering edit mode

Vince S. Buffalo ▴ 140

@vince-s-buffalo-4618

Last seen 9.7 years ago

United States

Hi All, I have two vectors of alleles stored as DNAStringSetLists. For each element in both lists, I need to find the length of the intersecting set. Using mapply() and intersect() take too long, as does sapply(dna.set.list, as.character) (and then using mclapply or lapply to find intersect on characters). Is there a fast way to do this? I have vectors ~12 million rows long. Vince -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]

• 1.1k views

ADD COMMENT • link updated 10.6 years ago by Hervé Pagès 16k • written 10.6 years ago by Vince S. Buffalo ▴ 140

score 0 · Answer 1 · 2013-10-15

On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. For each element > in both lists, I need to find the length of the intersecting set. Using > mapply() and intersect() take too long, as does sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find intersect on > characters). Is there a fast way to do this? I have vectors ~12 million > rows long. For a couple of hacky solutions, maybe create an index i1 into one of the lists l1 i1 <- rep(seq_along(l1), elementLengths(l1)) then create artificial alleles that are tagged by the element id x1 <- paste0(unlist(l1), i1) x2 <- paste0(unlist(l2), rep(seq_along(l2), elementLengths(l2))) and count how many of x1 are in x2, grouped by i1 tabulate(i1[x1 %in% x2]) This seems to be faster than sum(as(l1, "CharacterList"), %in% as(l2, "CharacterList")) (the x1 or as() could be surrounded by unique() if the elements are not already). Martin > > Vince > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

score 0 · Answer 2 · 2013-10-15

Hi Vince, On Tue, Oct 15, 2013 at 4:16 PM, Vince S. Buffalo <vsbuffalo at="" gmail.com=""> wrote: > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. For each element > in both lists, I need to find the length of the intersecting set. Using > mapply() and intersect() take too long, as does sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find intersect on > characters). Is there a fast way to do this? I have vectors ~12 million > rows long. Perhaps data.table can help here, but I'm having a hard time understanding the data you have and the output you want. Could you provide small test-set sized dataset that we can poke at? Thanks, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech

score 0 · Answer 3 · 2013-10-15

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Vince, Sounds like maybe you have a use case for a pintersect method for DNAStringSetList objects: setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"), function(x, y, ...) { if (length(x) != length(y)) stop("'x' and 'y' must have the same length") ux <- unlist(x, use.name=FALSE) uy <- unlist(y, use.name=FALSE) string_id <- selfmatch(c(ux, uy)) ux_id <- string_id[seq_along(ux)] uy_id <- string_id[seq_along(uy) + length(ux)] ux_group <- rep.int(seq_along(x), elementLengths(x)) uy_group <- rep.int(seq_along(y), elementLengths(y)) m <- IRanges:::matchIntegerPairs(ux_group, ux_id, uy_group, uy_id) keep_idx <- which(!is.na(m)) ux <- ux[keep_idx] ux_group <- ux_group[keep_idx] ux_id <- ux_id[keep_idx] sm <- IRanges:::selfmatchIntegerPairs(ux_group, ux_id) keep_idx <- sm == seq_along(sm) ux <- ux[keep_idx] ux_group <- ux_group[keep_idx] ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) relist(ux, ans_skeleton) } ) Then: > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), c("T", "G")) > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", c("G", "T")) > pintersect(alleles1, alleles2) DNAStringSetList of length 4 [[1]] A [[2]] A [[3]] A DNAStringSet instance of length 0 [[4]] T G Should take about 20 sec. on your 12-million long vectors. Then use elementLengths() on this result to get the lengths of the intersecting sets. HTH, H. On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. For each element > in both lists, I need to find the length of the intersecting set. Using > mapply() and intersect() take too long, as does sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find intersect on > characters). Is there a fast way to do this? I have vectors ~12 million > rows long. > > Vince > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 10.6 years ago Hervé Pagès 16k

0

Entering edit mode

Neat. I was just going to suggest using matchIntegerPairs(). You could have used togroup(x) for this right? ux_group <- rep.int(seq_along(x), elementLengths(x)) On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Vince, > > Sounds like maybe you have a use case for a pintersect method for > DNAStringSetList objects: > > setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"), > function(x, y, ...) > { > if (length(x) != length(y)) > stop("'x' and 'y' must have the same length") > > ux <- unlist(x, use.name=FALSE) > uy <- unlist(y, use.name=FALSE) > string_id <- selfmatch(c(ux, uy)) > ux_id <- string_id[seq_along(ux)] > uy_id <- string_id[seq_along(uy) + length(ux)] > ux_group <- rep.int(seq_along(x), elementLengths(x)) > uy_group <- rep.int(seq_along(y), elementLengths(y)) > m <- IRanges:::matchIntegerPairs(**ux_group, ux_id, uy_group, > uy_id) > keep_idx <- which(!is.na(m)) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ux_id <- ux_id[keep_idx] > sm <- IRanges:::**selfmatchIntegerPairs(ux_**group, ux_id) > keep_idx <- sm == seq_along(sm) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) > relist(ux, ans_skeleton) > } > ) > > Then: > > > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), > c("T", "G")) > > > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", > c("G", "T")) > > > pintersect(alleles1, alleles2) > DNAStringSetList of length 4 > [[1]] A > [[2]] A > [[3]] A DNAStringSet instance of length 0 > [[4]] T G > > Should take about 20 sec. on your 12-million long vectors. > > Then use elementLengths() on this result to get the lengths of the > intersecting sets. > > HTH, > H. > > > > On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > >> Hi All, >> >> I have two vectors of alleles stored as DNAStringSetLists. For each >> element >> in both lists, I need to find the length of the intersecting set. Using >> mapply() and intersect() take too long, as does sapply(dna.set.list, >> as.character) (and then using mclapply or lapply to find intersect on >> characters). Is there a fast way to do this? I have vectors ~12 million >> rows long. >> >> Vince >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

And I guess this also works for CharacterList, and a bit easier for IntegerList. On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla@gene.com> wrote: > Neat. I was just going to suggest using matchIntegerPairs(). > > You could have used togroup(x) for this right? > ux_group <- rep.int(seq_along(x), elementLengths(x)) > > > On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > >> Hi Vince, >> >> Sounds like maybe you have a use case for a pintersect method for >> DNAStringSetList objects: >> >> setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"), >> function(x, y, ...) >> { >> if (length(x) != length(y)) >> stop("'x' and 'y' must have the same length") >> >> ux <- unlist(x, use.name=FALSE) >> uy <- unlist(y, use.name=FALSE) >> string_id <- selfmatch(c(ux, uy)) >> ux_id <- string_id[seq_along(ux)] >> uy_id <- string_id[seq_along(uy) + length(ux)] >> ux_group <- rep.int(seq_along(x), elementLengths(x)) >> uy_group <- rep.int(seq_along(y), elementLengths(y)) >> m <- IRanges:::matchIntegerPairs(**ux_group, ux_id, uy_group, >> uy_id) >> keep_idx <- which(!is.na(m)) >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ux_id <- ux_id[keep_idx] >> sm <- IRanges:::**selfmatchIntegerPairs(ux_**group, ux_id) >> keep_idx <- sm == seq_along(sm) >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) >> relist(ux, ans_skeleton) >> } >> ) >> >> Then: >> >> > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), >> c("T", "G")) >> >> > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", >> c("G", "T")) >> >> > pintersect(alleles1, alleles2) >> DNAStringSetList of length 4 >> [[1]] A >> [[2]] A >> [[3]] A DNAStringSet instance of length 0 >> [[4]] T G >> >> Should take about 20 sec. on your 12-million long vectors. >> >> Then use elementLengths() on this result to get the lengths of the >> intersecting sets. >> >> HTH, >> H. >> >> >> >> On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: >> >>> Hi All, >>> >>> I have two vectors of alleles stored as DNAStringSetLists. For each >>> element >>> in both lists, I need to find the length of the intersecting set. Using >>> mapply() and intersect() take too long, as does sapply(dna.set.list, >>> as.character) (and then using mclapply or lapply to find intersect on >>> characters). Is there a fast way to do this? I have vectors ~12 million >>> rows long. >>> >>> Vince >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Right, this is actually quite generic and would work on any CompressedList derivative with an elementType that supports c(), [, and match() (by default sefmatch() will just call match()). So it should go to IRanges rather than Biostrings. The other parallel set operations (punion, psetdiff) could be implemented in the same manner. Added to my TODO list. Yes, togroup(x) does rep.int(seq_along(x), elementLengths(x)) but it's cleaner to use the former. H. On 10/16/2013 06:31 AM, Michael Lawrence wrote: > And I guess this also works for CharacterList, and a bit easier for > IntegerList. > > > On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">> wrote: > > Neat. I was just going to suggest using matchIntegerPairs(). > > You could have used togroup(x) for this right? > ux_group <- rep.int <http: rep.int="">(seq_along(x), elementLengths(x)) > > > On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Vince, > > Sounds like maybe you have a use case for a pintersect method for > DNAStringSetList objects: > > setMethod("pintersect", c("DNAStringSetList", > "DNAStringSetList"), > function(x, y, ...) > { > if (length(x) != length(y)) > stop("'x' and 'y' must have the same length") > > ux <- unlist(x, use.name <http: use.name="">=FALSE) > uy <- unlist(y, use.name <http: use.name="">=FALSE) > string_id <- selfmatch(c(ux, uy)) > ux_id <- string_id[seq_along(ux)] > uy_id <- string_id[seq_along(uy) + length(ux)] > ux_group <- rep.int <http: rep.int="">(seq_along(x), > elementLengths(x)) > uy_group <- rep.int <http: rep.int="">(seq_along(y), > elementLengths(y)) > m <- IRanges:::matchIntegerPairs(__ux_group, ux_id, > uy_group, uy_id) > keep_idx <- which(!is.na <http: is.na="">(m)) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ux_id <- ux_id[keep_idx] > sm <- IRanges:::__selfmatchIntegerPairs(ux___group, ux_id) > keep_idx <- sm == seq_along(sm) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) > relist(ux, ans_skeleton) > } > ) > > Then: > > > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", > "T"), c("T", "G")) > > > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), > "C", c("G", "T")) > > > pintersect(alleles1, alleles2) > DNAStringSetList of length 4 > [[1]] A > [[2]] A > [[3]] A DNAStringSet instance of length 0 > [[4]] T G > > Should take about 20 sec. on your 12-million long vectors. > > Then use elementLengths() on this result to get the lengths of the > intersecting sets. > > HTH, > H. > > > > On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. > For each element > in both lists, I need to find the length of the intersecting > set. Using > mapply() and intersect() take too long, as does > sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find > intersect on > characters). Is there a fast way to do this? I have vectors > ~12 million > rows long. > > Vince > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi All, Thank you all so much for your excellent and prompt replies. I've used Hervé's solution and it worked wonderfully it took less than 1 minute. Bioconductor's power and speed never ceases to amaze me. Thanks, Vince On Wed, Oct 16, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Right, this is actually quite generic and would work on any > CompressedList derivative with an elementType that supports c(), [, > and match() (by default sefmatch() will just call match()). So it > should go to IRanges rather than Biostrings. The other parallel set > operations (punion, psetdiff) could be implemented in the same manner. > Added to my TODO list. > > Yes, togroup(x) does rep.int(seq_along(x), elementLengths(x)) but > it's cleaner to use the former. > > H. > > > On 10/16/2013 06:31 AM, Michael Lawrence wrote: > >> And I guess this also works for CharacterList, and a bit easier for >> IntegerList. >> >> >> On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla@gene.com>> <mailto:michafla@gene.com>> wrote: >> >> Neat. I was just going to suggest using matchIntegerPairs(). >> >> You could have used togroup(x) for this right? >> ux_group <- rep.int <http: rep.int="">(seq_along(x), elementLengths(x)) >> >> >> >> On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Vince, >> >> Sounds like maybe you have a use case for a pintersect method for >> DNAStringSetList objects: >> >> setMethod("pintersect", c("DNAStringSetList", >> "DNAStringSetList"), >> function(x, y, ...) >> { >> if (length(x) != length(y)) >> stop("'x' and 'y' must have the same length") >> >> ux <- unlist(x, use.name <http: use.name="">=FALSE) >> uy <- unlist(y, use.name <http: use.name="">=FALSE) >> >> string_id <- selfmatch(c(ux, uy)) >> ux_id <- string_id[seq_along(ux)] >> uy_id <- string_id[seq_along(uy) + length(ux)] >> ux_group <- rep.int <http: rep.int="">(seq_along(x), >> elementLengths(x)) >> uy_group <- rep.int <http: rep.int="">(seq_along(y), >> elementLengths(y)) >> m <- IRanges:::matchIntegerPairs(__**ux_group, ux_id, >> uy_group, uy_id) >> keep_idx <- which(!is.na <http: is.na="">(m)) >> >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ux_id <- ux_id[keep_idx] >> sm <- IRanges:::__**selfmatchIntegerPairs(ux___**group, >> ux_id) >> >> keep_idx <- sm == seq_along(sm) >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) >> relist(ux, ans_skeleton) >> } >> ) >> >> Then: >> >> > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", >> "T"), c("T", "G")) >> >> > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), >> "C", c("G", "T")) >> >> > pintersect(alleles1, alleles2) >> DNAStringSetList of length 4 >> [[1]] A >> [[2]] A >> [[3]] A DNAStringSet instance of length 0 >> [[4]] T G >> >> Should take about 20 sec. on your 12-million long vectors. >> >> Then use elementLengths() on this result to get the lengths of the >> intersecting sets. >> >> HTH, >> H. >> >> >> >> On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: >> >> Hi All, >> >> I have two vectors of alleles stored as DNAStringSetLists. >> For each element >> in both lists, I need to find the length of the intersecting >> set. Using >> mapply() and intersect() take too long, as does >> sapply(dna.set.list, >> as.character) (and then using mclapply or lapply to find >> intersect on >> characters). Is there a fast way to do this? I have vectors >> ~12 million >> rows long. >> >> Vince >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<http s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Vince S. Buffalo ▴ 140