Deleting object rows while looping - II
2
0
Entering edit mode
@herve-pages-1542
Last seen 22 hours ago
Seattle, WA, United States
An alternative to sorting the data.frame and creating views on it is to use a DataFrame and split it: dat <- DataFrame(chrpos, seqs) splitted_dat <- split(dat, dat$chrpos) Unlike splitting a data.frame, which is generally very inefficient (especially if the split factor has a lot of levels -- don't try this with > 50000 levels or you'll blow your machine), splitting a DataFrame is very fast and very efficient. If the split factor has > 20000 levels, it will typically be hundreds or thousands times faster than with a data.frame and use hundreds times less memory. Thanks Michael! The result of splitting a DataFrame is a SplitDataFrameList object, which is a particular type of CompressedList object: > is(splitted_dat, "CompressedList") [1] TRUE A SplitDataFrameList, like any CompressedList object, can be unlisted in the blink of an eye: > system.time(dat2 <- unlist(splitted_dat)) user system elapsed 0.000 0.000 0.001 So what was the point of splitting and now unlisting? In 'dat2', all the rows that correspond to the same locus (i.e. same dat2$chrpos) are grouped together, like in Martin's sorted data.frame, and the indices of the starting and ending row of each group (Martin's views) can be obtained by doing PartitioningByEnd() on 'splitted_dat': views <- PartitioningByEnd(splitted_dat) > views PartitioningByEnd of length 1237342 start end width names [1] 1 3 3 chr1 1 [2] 4 14 11 chr1 10 [3] 15 20 6 chr1 100 [4] 21 27 7 chr1 1000 ... ... ... ... ... [1237339] 201767 201770 4 chrY 1854205 [1237340] 201771 201773 3 chrY 1854206 [1237341] 201774 201784 11 chrY 1854207 [1237342] 201785 201796 12 chrY 1854208 Then get the starts/ends of the views with: start(views) end(views) HTH, H. On 04/30/2013 01:53 PM, Martin Morgan wrote: > On 04/29/2013 11:52 PM, Daniel Berner wrote: >> Martin, Steve >> >> Thanks for your suggestions, much appreciated! My core problem is that >> the >> operations inside the loop are too complex for me to allow >> implementing any >> function() and apply() combination (there are multiple interleaved >> if() lines >> etc). and I was not aware of the data.table package, but this also looks >> hard to me, given the many steps performed within the loop. >> >> to be more specific, I want to reduce a number of Illumina short reads at >> genomic loci to diploid or haploid (depending on coverage) consensus >> genotypes, by evaluating several test criteria. Below I am copying the >> full >> code. the input is an alignment in BAM format. >> >> Any suggestion to make this faster is most welcome, as the code takes >> more >> than a week to run for each individual, and I have to process 96 of >> them (I >> have multicore, but still)! It seems to me that the rate-limiting step >> is the >> subsetting; the other manipulations are fast enough to work in such a >> primitive loop. >> >> cheers, >> Daniel Berner >> Zoological Institute >> University of Basel >> >> ######################### >> library(Rsamtools) >> rm(list=ls()) >> M<-"GAAGC" ### specify MID (individual) >> lib<-25 ### specify library >> >> # genotyping settings: >> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage >> is more than foldcov times the overall mean coverage >> (=length(posi)/length(u.chrpos)). eliminates repeats >> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the >> two dominant haplotypes together is <= prop.1.2, relative to the total >> coverage of the locus. eliminates repeats >> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of >> the coverage of the two dominant haplotypes (or simply the coverage >> for monomorphic loci) is less than sum.1.2 >> het<-1/3 # a diploid locus is called heterozygous if the ratio >> count.htp.2nd/count.htp.dominant is greater than het >> min<-2 # a haploid locus is called when coverage is at least min (but >> lower than sum.1.2) >> >> # infile upload: >> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the >> BAMs >> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are >> the BAMs >> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # >> infile path >> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", >> sep="") # outfile path >> #outfile<-paste("/Users/dani/Documents/genome\ >> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, >> ".consensus.txt",sep="") >> param >> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", >> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) >> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 >> minute >> >> # extract the relevant data: >> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.charac ter(f$rname[grep("-",f$strand,invert=F)])) >> >> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$st rand,invert=F)]) >> > > I was wondering if that 'f$tag' was a typo, it seems to break the > symmetry of these commands...? > >> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand ,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]) ) >> >> chrpos<-paste(chr, posi, sep=' ') >> u.chrpos<-sort(unique(chrpos)) >> length(posi) #total number of alignments >> length(u.chrpos) #this gives the number of unique loci >> length(posi)/length(u.chrpos) #mean coverage per locus >> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative >> coverage threshold based on average coverage >> >> # derive the data object to process: >> dat<-data.frame(cbind(chrpos, seqs)) > > probably want data.frame(<...>, stringsAsFactors=FALSE) > >> >> # clean the environment and release memory: >> rm(f, chr, posi, seqs, chrpos) #no longer used >> gc() >> >> # process the dat object while writing to 'cons' >> # outfile data structure: chrpos / coverage / X-Y (needed for SNP >> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq >> cons<-NULL > > A variant of Herve's suggestion to use append=TRUE (his idea is better!) is > > con <- file(outfile, "w") > > which opens the file, and then instead of accumulating the results in > memory you could, inside your loop, > > write.table(<intermediate result="">, con) > > and then when done remember to > > close(con) > > you'll need to make sure not to write headers in the write.table() > statement. But I think this is mostly not relevant in your revised work > flow... > >> for (i in 1:length(u.chrpos)){ > > My suggestion is a little complicated, but you're really interested in > turning this loop into a vector operation. I think you can do this by > sorting your data frame > > dat <- dat[order(dat$chrpos, dat$seq),] > > It's helpful to have a 'view' onto dat, where the view is a data.frame > with indices marking the beginning and end of each 'chrpos' group. I > created some helper functions > > ends <- function(x) > c([-length(x)] != x[-1], TRUE) > starts <- function(x) > c(TRUE, ends(x)[-length(x)]) > > which creates a logical vector of the same length as x, with a TRUE > whenever a run of identical chrpos start or end (something similar could > be done with the rle() or Rle() functions). So my view is then > > view <- data.frame(start = which(starts(dat$chrpos)), > end = which(ends(dat$chrpos))) > >> stack<-DNAStringSet(as.character(subset(dat, >> chrpos==u.chrpos[i])[,2])) >> stack.le<-length(stack) >> if(stack.le<=c.thr){ > > for this if statement we need to know how many distinct sequences are in > each view. I did this by numbering the sequences sequentially > > seqid <- cumsum(starts(dat$seqs)) > > and then finding the difference between the id of the last and first > sequence > > view$n_seq <- seqid[view$end] - seqid[view$start] + 1L > >> #monomorphic: >> if (length(unique(stack))==1){ > > The 'view' onto the data representing the monomorphic locations is > > mono <- view[view$n_seq == 1, , drop=FALSE] > >> #well covered: >> if(stack.le>=sum.1.2){ > > and the 'well-covered' locations are > > idx <- (mono$end - mono$start + 1L) >= sum.1.2 > ok <- mono[idx, , drop=FALSE] > >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >> as.character(stack[1]), sep=" ") > > and you're after the information > > gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N", > dat[ok$start, "seqs"]) > > (although I think you're really just adding columns and entries to > 'view'...). To emphasize, this processes _all_ the monomorphic, > well-covered sites, without any looping. > > And so on through the rest of your code. > > This type of operation is well-suited to data.table, though I'm not sure > enough of the syntax and implementation to know whether Steve's > > dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) > result <- dat[, { > list(n.reads=.N, n.unique=length(unique(seqs))) > }, by=c('chr', 'pos')] > > is implemented efficiently -- I'm sure the .N is; just not whether > clever thinking is used behind the scenes to avoid looping through > function(x) length(unique(x)). The syntax is certainly clearer than my > 'view' approach. > > Martin > > >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >> as.character(stack[1]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #poorly covered: >> if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >> as.character(stack[1]), sep=" ") >> cons<-rbind(cons, gtp.X) >> } >> } >> #polymorphic: >> if(length(unique(stack))>1){ >> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] >> #no repeat issue: >> if((sum(cnts)/stack.le)>prop.1.2){ >> #heterozygote: >> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", >> as.character(names(cnts)[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", >> as.character(names(cnts)[2]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #homozygote (minor variant too rare, artifact): >> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ >> # minor variant too rare, artifact; hence locus diploid homozygote >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >> as.character(names(cnts)[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >> as.character(names(cnts)[1]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #coverage poor, gtp unclear, haploid call: >> if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >> as.character(names(cnts)[1]), sep=" ") >> cons<-rbind(cons, gtp.X) >> } >> } >> } >> } >> } >> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write >> to individual file >> ### END >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Coverage Alignment Cancer PROcess Coverage Alignment Cancer PROcess • 1.1k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 22 hours ago
Seattle, WA, United States
On 04/30/2013 08:12 PM, Hervé Pagès wrote: > An alternative to sorting the data.frame and creating views on it is to > use a DataFrame and split it: > > dat <- DataFrame(chrpos, seqs) > splitted_dat <- split(dat, dat$chrpos) splitted? mmmh, looks like I should use split_dat instead of splitted_dat... H. > > Unlike splitting a data.frame, which is generally very inefficient > (especially if the split factor has a lot of levels -- don't try this > with > 50000 levels or you'll blow your machine), splitting a DataFrame > is very fast and very efficient. If the split factor has > 20000 levels, > it will typically be hundreds or thousands times faster than with a > data.frame and use hundreds times less memory. Thanks Michael! > > The result of splitting a DataFrame is a SplitDataFrameList object, > which is a particular type of CompressedList object: > > > is(splitted_dat, "CompressedList") > [1] TRUE > > A SplitDataFrameList, like any CompressedList object, can be unlisted > in the blink of an eye: > > > system.time(dat2 <- unlist(splitted_dat)) > user system elapsed > 0.000 0.000 0.001 > > So what was the point of splitting and now unlisting? In 'dat2', all > the rows that correspond to the same locus (i.e. same dat2$chrpos) are > grouped together, like in Martin's sorted data.frame, and the indices > of the starting and ending row of each group (Martin's views) can be > obtained by doing PartitioningByEnd() on 'splitted_dat': > > views <- PartitioningByEnd(splitted_dat) > > > views > PartitioningByEnd of length 1237342 > start end width names > [1] 1 3 3 chr1 1 > [2] 4 14 11 chr1 10 > [3] 15 20 6 chr1 100 > [4] 21 27 7 chr1 1000 > ... ... ... ... ... > [1237339] 201767 201770 4 chrY 1854205 > [1237340] 201771 201773 3 chrY 1854206 > [1237341] 201774 201784 11 chrY 1854207 > [1237342] 201785 201796 12 chrY 1854208 > > Then get the starts/ends of the views with: > > start(views) > end(views) > > HTH, > H. > > > On 04/30/2013 01:53 PM, Martin Morgan wrote: >> On 04/29/2013 11:52 PM, Daniel Berner wrote: >>> Martin, Steve >>> >>> Thanks for your suggestions, much appreciated! My core problem is that >>> the >>> operations inside the loop are too complex for me to allow >>> implementing any >>> function() and apply() combination (there are multiple interleaved >>> if() lines >>> etc). and I was not aware of the data.table package, but this also looks >>> hard to me, given the many steps performed within the loop. >>> >>> to be more specific, I want to reduce a number of Illumina short >>> reads at >>> genomic loci to diploid or haploid (depending on coverage) consensus >>> genotypes, by evaluating several test criteria. Below I am copying the >>> full >>> code. the input is an alignment in BAM format. >>> >>> Any suggestion to make this faster is most welcome, as the code takes >>> more >>> than a week to run for each individual, and I have to process 96 of >>> them (I >>> have multicore, but still)! It seems to me that the rate-limiting step >>> is the >>> subsetting; the other manipulations are fast enough to work in such a >>> primitive loop. >>> >>> cheers, >>> Daniel Berner >>> Zoological Institute >>> University of Basel >>> >>> ######################### >>> library(Rsamtools) >>> rm(list=ls()) >>> M<-"GAAGC" ### specify MID (individual) >>> lib<-25 ### specify library >>> >>> # genotyping settings: >>> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage >>> is more than foldcov times the overall mean coverage >>> (=length(posi)/length(u.chrpos)). eliminates repeats >>> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the >>> two dominant haplotypes together is <= prop.1.2, relative to the total >>> coverage of the locus. eliminates repeats >>> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of >>> the coverage of the two dominant haplotypes (or simply the coverage >>> for monomorphic loci) is less than sum.1.2 >>> het<-1/3 # a diploid locus is called heterozygous if the ratio >>> count.htp.2nd/count.htp.dominant is greater than het >>> min<-2 # a haploid locus is called when coverage is at least min (but >>> lower than sum.1.2) >>> >>> # infile upload: >>> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the >>> BAMs >>> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are >>> the BAMs >>> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # >>> infile path >>> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", >>> sep="") # outfile path >>> #outfile<-paste("/Users/dani/Documents/genome\ >>> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, >>> ".consensus.txt",sep="") >>> param >>> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", >>> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) >>> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 >>> minute >>> >>> # extract the relevant data: >>> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.chara cter(f$rname[grep("-",f$strand,invert=F)])) >>> >>> >>> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$s trand,invert=F)]) >>> >>> >> >> I was wondering if that 'f$tag' was a typo, it seems to break the >> symmetry of these commands...? >> >>> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$stran d,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)] )) >>> >>> >>> chrpos<-paste(chr, posi, sep=' ') >>> u.chrpos<-sort(unique(chrpos)) >>> length(posi) #total number of alignments >>> length(u.chrpos) #this gives the number of unique loci >>> length(posi)/length(u.chrpos) #mean coverage per locus >>> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative >>> coverage threshold based on average coverage >>> >>> # derive the data object to process: >>> dat<-data.frame(cbind(chrpos, seqs)) >> >> probably want data.frame(<...>, stringsAsFactors=FALSE) >> >>> >>> # clean the environment and release memory: >>> rm(f, chr, posi, seqs, chrpos) #no longer used >>> gc() >>> >>> # process the dat object while writing to 'cons' >>> # outfile data structure: chrpos / coverage / X-Y (needed for SNP >>> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR- haploid_low(L) / >>> seq >>> cons<-NULL >> >> A variant of Herve's suggestion to use append=TRUE (his idea is >> better!) is >> >> con <- file(outfile, "w") >> >> which opens the file, and then instead of accumulating the results in >> memory you could, inside your loop, >> >> write.table(<intermediate result="">, con) >> >> and then when done remember to >> >> close(con) >> >> you'll need to make sure not to write headers in the write.table() >> statement. But I think this is mostly not relevant in your revised work >> flow... >> >>> for (i in 1:length(u.chrpos)){ >> >> My suggestion is a little complicated, but you're really interested in >> turning this loop into a vector operation. I think you can do this by >> sorting your data frame >> >> dat <- dat[order(dat$chrpos, dat$seq),] >> >> It's helpful to have a 'view' onto dat, where the view is a data.frame >> with indices marking the beginning and end of each 'chrpos' group. I >> created some helper functions >> >> ends <- function(x) >> c([-length(x)] != x[-1], TRUE) >> starts <- function(x) >> c(TRUE, ends(x)[-length(x)]) >> >> which creates a logical vector of the same length as x, with a TRUE >> whenever a run of identical chrpos start or end (something similar could >> be done with the rle() or Rle() functions). So my view is then >> >> view <- data.frame(start = which(starts(dat$chrpos)), >> end = which(ends(dat$chrpos))) >> >>> stack<-DNAStringSet(as.character(subset(dat, >>> chrpos==u.chrpos[i])[,2])) >>> stack.le<-length(stack) >>> if(stack.le<=c.thr){ >> >> for this if statement we need to know how many distinct sequences are in >> each view. I did this by numbering the sequences sequentially >> >> seqid <- cumsum(starts(dat$seqs)) >> >> and then finding the difference between the id of the last and first >> sequence >> >> view$n_seq <- seqid[view$end] - seqid[view$start] + 1L >> >>> #monomorphic: >>> if (length(unique(stack))==1){ >> >> The 'view' onto the data representing the monomorphic locations is >> >> mono <- view[view$n_seq == 1, , drop=FALSE] >> >>> #well covered: >>> if(stack.le>=sum.1.2){ >> >> and the 'well-covered' locations are >> >> idx <- (mono$end - mono$start + 1L) >= sum.1.2 >> ok <- mono[idx, , drop=FALSE] >> >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >>> as.character(stack[1]), sep=" ") >> >> and you're after the information >> >> gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N", >> dat[ok$start, "seqs"]) >> >> (although I think you're really just adding columns and entries to >> 'view'...). To emphasize, this processes _all_ the monomorphic, >> well-covered sites, without any looping. >> >> And so on through the rest of your code. >> >> This type of operation is well-suited to data.table, though I'm not sure >> enough of the syntax and implementation to know whether Steve's >> >> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) >> result <- dat[, { >> list(n.reads=.N, n.unique=length(unique(seqs))) >> }, by=c('chr', 'pos')] >> >> is implemented efficiently -- I'm sure the .N is; just not whether >> clever thinking is used behind the scenes to avoid looping through >> function(x) length(unique(x)). The syntax is certainly clearer than my >> 'view' approach. >> >> Martin >> >> >>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >>> as.character(stack[1]), sep=" ") >>> cons<-rbind(cons, gtp.X, gtp.Y) >>> } >>> #poorly covered: >>> if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >>> as.character(stack[1]), sep=" ") >>> cons<-rbind(cons, gtp.X) >>> } >>> } >>> #polymorphic: >>> if(length(unique(stack))>1){ >>> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] >>> #no repeat issue: >>> if((sum(cnts)/stack.le)>prop.1.2){ >>> #heterozygote: >>> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", >>> as.character(names(cnts)[1]), sep=" ") >>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", >>> as.character(names(cnts)[2]), sep=" ") >>> cons<-rbind(cons, gtp.X, gtp.Y) >>> } >>> #homozygote (minor variant too rare, artifact): >>> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ >>> # minor variant too rare, artifact; hence locus diploid homozygote >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >>> as.character(names(cnts)[1]), sep=" ") >>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >>> as.character(names(cnts)[1]), sep=" ") >>> cons<-rbind(cons, gtp.X, gtp.Y) >>> } >>> #coverage poor, gtp unclear, haploid call: >>> if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >>> as.character(names(cnts)[1]), sep=" ") >>> cons<-rbind(cons, gtp.X) >>> } >>> } >>> } >>> } >>> } >>> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write >>> to individual file >>> ### END >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi folks, Wow -- that's a lot of great suggestions coming straight from the bioc-wizards themselves. I'm still not going to jump into the details on the logic of what Daniel *really* wants, just want to make a comment on Martin's last point: > This type of operation is well-suited to data.table, though I'm not sure > enough of the syntax and implementation to know whether Steve's > > > dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) > result <- dat[, { > list(n.reads=.N, n.unique=length(unique(seqs))) > }, by=c('chr', 'pos')] > > is implemented efficiently -- I'm sure the .N is; just not whether clever > thinking is used behind the scenes to avoid looping through function(x) > length(unique(x)). The syntax is certainly clearer than my 'view' approach. I only really used this as a pedagogical example to show that one could access subsets of the columns directly by name within the `j` expression of the `[.data.table` function. The .N is essentially a no-op to call as it is already computed for you, but repeatedly calling a function within each grouped subset will incur the overhead of a function call within each subgroup. Still, I think the OP would notice a significant boost in performance by simply naively translating his code using data.table -- if you really wanted to eek out the last bit of performance (which isn't really necessary if you're just doing things once, but if you're building a pipeline, fell free) that'd be another convo ... Anyway, it looks like there's a lot of good stuff in this thread already. I'd be curious to here back from Daniel when he tries a few of these things. Also, wasn't aware of the new(?) `SplitDataFrame` mojo -- very nice stuff. -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech
ADD COMMENT
0
Entering edit mode
Hi Steve, On 05/01/2013 09:50 AM, Steve Lianoglou wrote: > Hi folks, > > Wow -- that's a lot of great suggestions coming straight from the > bioc-wizards themselves. > > I'm still not going to jump into the details on the logic of what > Daniel *really* wants, just want to make a comment on Martin's last > point: > >> This type of operation is well-suited to data.table, though I'm not sure >> enough of the syntax and implementation to know whether Steve's >> >> >> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) >> result <- dat[, { >> list(n.reads=.N, n.unique=length(unique(seqs))) >> }, by=c('chr', 'pos')] >> >> is implemented efficiently -- I'm sure the .N is; just not whether clever >> thinking is used behind the scenes to avoid looping through function(x) >> length(unique(x)). The syntax is certainly clearer than my 'view' approach. > > I only really used this as a pedagogical example to show that one > could access subsets of the columns directly by name within the `j` > expression of the `[.data.table` function. > > The .N is essentially a no-op to call as it is already computed for > you, but repeatedly calling a function within each grouped subset will > incur the overhead of a function call within each subgroup. > > Still, I think the OP would notice a significant boost in performance > by simply naively translating his code using data.table -- if you > really wanted to eek out the last bit of performance (which isn't > really necessary if you're just doing things once, but if you're > building a pipeline, fell free) that'd be another convo ... > > Anyway, it looks like there's a lot of good stuff in this thread > already. I'd be curious to here back from Daniel when he tries a few > of these things. Also, wasn't aware of the new(?) `SplitDataFrame` > mojo -- very nice stuff. This has been in IRanges since the beginning of the package (2008). IIRC when Michael added DataFrame/SplitDataFrameList to the package, the primary use case was to store the values of a RangedData object (the "values" slot of RangedData is SplitDataFrameList). If *you* were not aware of SplitDataFrameList, then that means that *we* are failing to properly communicate/document/expose/advertise the IRanges/GenomicRanges infrastructure :-/ H. > > -steve > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6