Why does a call to "unique" removes a DNAStringSet names?

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Nico, Sorry it's taken awhile to get back to you. I wanted to ask about what behavior you'd expect from a call to unique() on a DNAStringSet, i.e., what is your use case? unique() on a named character vector drops names: chr <- c(a="A", c="C", aa="A", c="CC") > unique(chr) [1] "A" "C" "CC" Same for a named list: lst <- list(a="A", c="C", aa="A", c="CC") > unique(lst) [[1]] [1] "A" [[2]] [1] "C" [[3]] [1] "CC" unique() on a DNAStringSet was patterned after this behavior. If names were kept, would it be useful to retain only the name of the first duplicate? In the data above there are two "A"'s. Would you want 'a' kept and 'aa' dropped? Valerie On 07/26/2012 08:36 AM, Nicolas Delhomme wrote: > Hi, > > I've just realized that a call to unique on a DNAStringSet would result in the names slot to disappear. There's nothing about this in the documentation, but if that's the desired effect, warning about it would be good :-) > > Here is how to reproduce it: > > library(Biostrings) > dset<-DNAStringSet(c("A","C")) > names(dset)<- c("a","a") > dset > unique(dset) > > > It gives: > >> dset > A DNAStringSet instance of length 2 > width seq names > [1] 1 A a > [2] 1 C a >> unique(dset) > A DNAStringSet instance of length 2 > width seq > [1] 1 A > [2] 1 C > > My sessionInfo(): > > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/UTF-8/C/C/C/C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.25.8 IRanges_1.15.24 BiocGenerics_0.3.0 > > loaded via a namespace (and not attached): > [1] stats4_2.15.1 tools_2.15.1 > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Nathaniel Street Lab > Department of Plant Physiology > Ume? Plant Science Center > > Tel: +46 90 786 7989 > Email: nicolas.delhomme at plantphys.umu.se > SLU - Ume? universitet > Ume? S-901 87 Sweden > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

• 597 views

ADD COMMENT • link updated 11.5 years ago by Hervé Pagès 16k • written 11.5 years ago by Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Nico, Val, Yes sorry for taking so long Nico, I didn't notice your email before. 2 additional issues I didn't realize: (1) unique() does not drop the metadata columns of a DNAStringSet: > dset A DNAStringSet instance of length 2 width seq names [1] 1 A a [2] 1 C a > mcols(dset) DataFrame with 2 rows and 1 column score <integer> 1 1 2 2 > mcols(unique(dset)) DataFrame with 2 rows and 1 column score <integer> 1 1 2 2 (2) unique() doesn't treat DNAStringSet consistently with GRanges: > gr GRanges with 5 ranges and 2 metadata columns: seqnames ranges strand | score GC <rle> <iranges> <rle> | <integer> <numeric> a chr1 [1, 10] - | 1 1 b chr2 [2, 10] + | 2 0.888888888888889 c chr2 [3, 10] + | 3 0.777777777777778 d chr2 [2, 10] + | 2 0.888888888888889 e chr2 [4, 10] * | 4 0.666666666666667 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > unique(gr) GRanges with 4 ranges and 2 metadata columns: seqnames ranges strand | score GC <rle> <iranges> <rle> | <integer> <numeric> a chr1 [1, 10] - | 1 1 b chr2 [2, 10] + | 2 0.888888888888889 c chr2 [3, 10] + | 3 0.777777777777778 e chr2 [4, 10] * | 4 0.666666666666667 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 On a GRanges, it just does x[!duplicated(x)], so not only the names are propagated but also the metadata columns. So the choices are: (a) we do the same for DNAStringSet, even if that's not what base::unique() does, (b) we choose to have unique() drop the names and metadata columns of any Vector object (DNAStringSet, GRanges, etc...), (c) we add the 'use.names' and 'use.mcols' args to unique(), with defaults to FALSE? or to TRUE? (d) ? I have a small preference for (a) even though I'm not really sure what the use cases are. Whatever we do, we should have unique() behave consistently on any member of the Vector family and also treat names and metadata columns the same way. Thanks, H. On 11/15/2012 09:11 AM, Valerie Obenchain wrote: > Hi Nico, > > Sorry it's taken awhile to get back to you. I wanted to ask about what > behavior you'd expect from a call to unique() on a DNAStringSet, i.e., > what is your use case? > > > unique() on a named character vector drops names: > chr <- c(a="A", c="C", aa="A", c="CC") > > unique(chr) > [1] "A" "C" "CC" > > > Same for a named list: > lst <- list(a="A", c="C", aa="A", c="CC") > > unique(lst) > [[1]] > [1] "A" > > [[2]] > [1] "C" > > [[3]] > [1] "CC" > > > unique() on a DNAStringSet was patterned after this behavior. If names > were kept, would it be useful to retain only the name of the first > duplicate? In the data above there are two "A"'s. Would you want 'a' > kept and 'aa' dropped? > > Valerie > > > > > On 07/26/2012 08:36 AM, Nicolas Delhomme wrote: >> Hi, >> >> I've just realized that a call to unique on a DNAStringSet would >> result in the names slot to disappear. There's nothing about this in >> the documentation, but if that's the desired effect, warning about it >> would be good :-) >> >> Here is how to reproduce it: >> >> library(Biostrings) >> dset<-DNAStringSet(c("A","C")) >> names(dset)<- c("a","a") >> dset >> unique(dset) >> >> >> It gives: >> >>> dset >> A DNAStringSet instance of length 2 >> width seq names >> [1] 1 A a >> [2] 1 C a >>> unique(dset) >> A DNAStringSet instance of length 2 >> width seq >> [1] 1 A >> [2] 1 C >> >> My sessionInfo(): >> >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C/UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.25.8 IRanges_1.15.24 BiocGenerics_0.3.0 >> >> loaded via a namespace (and not attached): >> [1] stats4_2.15.1 tools_2.15.1 >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Nathaniel Street Lab >> Department of Plant Physiology >> Ume? Plant Science Center >> >> Tel: +46 90 786 7989 >> Email: nicolas.delhomme at plantphys.umu.se >> SLU - Ume? universitet >> Ume? S-901 87 Sweden >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.5 years ago Hervé Pagès 16k

Login before adding your answer.