Why does a call to "unique" removes a DNAStringSet names?

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 12 hours ago

Seattle, WA, United States

Hi Nico, Val, Just to let you know that this is addressed in the latest IRanges/Biostrings (devel): dna <- DNAStringSet(c(a="AAA", b="CC", c="ATTA", d="CC", e="CG")) mcols(dna) <- DataFrame(score=1:5) Then: > unique(dna) A DNAStringSet instance of length 4 width seq names [1] 3 AAA a [2] 2 CC b [3] 4 ATTA c [4] 2 CG e > mcols(unique(dna)) DataFrame with 4 rows and 1 column score <integer> 1 1 2 2 3 3 4 5 This new behavior is consistent with what unique() does on a GRanges object or a Vector object in general. Cheers, H. On 11/15/2012 03:27 PM, Hervé Pagès wrote: > Hi Nico, Val, > > Yes sorry for taking so long Nico, I didn't notice your email > before. > > 2 additional issues I didn't realize: > > (1) unique() does not drop the metadata columns of a DNAStringSet: > > > dset > A DNAStringSet instance of length 2 > width seq names > [1] 1 A a > [2] 1 C a > > mcols(dset) > DataFrame with 2 rows and 1 column > score > <integer> > 1 1 > 2 2 > > > mcols(unique(dset)) > DataFrame with 2 rows and 1 column > score > <integer> > 1 1 > 2 2 > > (2) unique() doesn't treat DNAStringSet consistently with GRanges: > > > gr > GRanges with 5 ranges and 2 metadata columns: > seqnames ranges strand | score GC > <rle> <iranges> <rle> | <integer> <numeric> > a chr1 [1, 10] - | 1 1 > b chr2 [2, 10] + | 2 0.888888888888889 > c chr2 [3, 10] + | 3 0.777777777777778 > d chr2 [2, 10] + | 2 0.888888888888889 > e chr2 [4, 10] * | 4 0.666666666666667 > --- > seqlengths: > chr1 chr2 chr3 > 1000 2000 1500 > > unique(gr) > GRanges with 4 ranges and 2 metadata columns: > seqnames ranges strand | score GC > <rle> <iranges> <rle> | <integer> <numeric> > a chr1 [1, 10] - | 1 1 > b chr2 [2, 10] + | 2 0.888888888888889 > c chr2 [3, 10] + | 3 0.777777777777778 > e chr2 [4, 10] * | 4 0.666666666666667 > --- > seqlengths: > chr1 chr2 chr3 > 1000 2000 1500 > > On a GRanges, it just does x[!duplicated(x)], so not only the names > are propagated but also the metadata columns. > > So the choices are: > (a) we do the same for DNAStringSet, even if that's not what > base::unique() does, > (b) we choose to have unique() drop the names and metadata columns > of any Vector object (DNAStringSet, GRanges, etc...), > (c) we add the 'use.names' and 'use.mcols' args to unique(), with > defaults to FALSE? or to TRUE? > (d) ? > > I have a small preference for (a) even though I'm not really sure what > the use cases are. Whatever we do, we should have unique() behave > consistently on any member of the Vector family and also treat > names and metadata columns the same way. > > Thanks, > H. > > > On 11/15/2012 09:11 AM, Valerie Obenchain wrote: >> Hi Nico, >> >> Sorry it's taken awhile to get back to you. I wanted to ask about what >> behavior you'd expect from a call to unique() on a DNAStringSet, i.e., >> what is your use case? >> >> >> unique() on a named character vector drops names: >> chr <- c(a="A", c="C", aa="A", c="CC") >> > unique(chr) >> [1] "A" "C" "CC" >> >> >> Same for a named list: >> lst <- list(a="A", c="C", aa="A", c="CC") >> > unique(lst) >> [[1]] >> [1] "A" >> >> [[2]] >> [1] "C" >> >> [[3]] >> [1] "CC" >> >> >> unique() on a DNAStringSet was patterned after this behavior. If names >> were kept, would it be useful to retain only the name of the first >> duplicate? In the data above there are two "A"'s. Would you want 'a' >> kept and 'aa' dropped? >> >> Valerie >> >> >> >> >> On 07/26/2012 08:36 AM, Nicolas Delhomme wrote: >>> Hi, >>> >>> I've just realized that a call to unique on a DNAStringSet would >>> result in the names slot to disappear. There's nothing about this in >>> the documentation, but if that's the desired effect, warning about it >>> would be good :-) >>> >>> Here is how to reproduce it: >>> >>> library(Biostrings) >>> dset<-DNAStringSet(c("A","C")) >>> names(dset)<- c("a","a") >>> dset >>> unique(dset) >>> >>> >>> It gives: >>> >>>> dset >>> A DNAStringSet instance of length 2 >>> width seq names >>> [1] 1 A a >>> [2] 1 C a >>>> unique(dset) >>> A DNAStringSet instance of length 2 >>> width seq >>> [1] 1 A >>> [2] 1 C >>> >>> My sessionInfo(): >>> >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] C/UTF-8/C/C/C/C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Biostrings_2.25.8 IRanges_1.15.24 BiocGenerics_0.3.0 >>> >>> loaded via a namespace (and not attached): >>> [1] stats4_2.15.1 tools_2.15.1 >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Nathaniel Street Lab >>> Department of Plant Physiology >>> Ume? Plant Science Center >>> >>> Tel: +46 90 786 7989 >>> Email: nicolas.delhomme at plantphys.umu.se >>> SLU - Ume? universitet >>> Ume? S-901 87 Sweden >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer Cancer • 1.7k views

ADD COMMENT • link 13.1 years ago Hervé Pagès 16k

Login before adding your answer.