Search
Question: Deleting object rows while looping - II
0
5.1 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
An alternative to sorting the data.frame and creating views on it is to use a DataFrame and split it: dat <- DataFrame(chrpos, seqs) splitted_dat <- split(dat, dat$chrpos) Unlike splitting a data.frame, which is generally very inefficient (especially if the split factor has a lot of levels -- don't try this with > 50000 levels or you'll blow your machine), splitting a DataFrame is very fast and very efficient. If the split factor has > 20000 levels, it will typically be hundreds or thousands times faster than with a data.frame and use hundreds times less memory. Thanks Michael! The result of splitting a DataFrame is a SplitDataFrameList object, which is a particular type of CompressedList object: > is(splitted_dat, "CompressedList") [1] TRUE A SplitDataFrameList, like any CompressedList object, can be unlisted in the blink of an eye: > system.time(dat2 <- unlist(splitted_dat)) user system elapsed 0.000 0.000 0.001 So what was the point of splitting and now unlisting? In 'dat2', all the rows that correspond to the same locus (i.e. same dat2$chrpos) are grouped together, like in Martin's sorted data.frame, and the indices of the starting and ending row of each group (Martin's views) can be obtained by doing PartitioningByEnd() on 'splitted_dat': views <- PartitioningByEnd(splitted_dat) > views PartitioningByEnd of length 1237342 start end width names [1] 1 3 3 chr1 1 [2] 4 14 11 chr1 10 [3] 15 20 6 chr1 100 [4] 21 27 7 chr1 1000 ... ... ... ... ... [1237339] 201767 201770 4 chrY 1854205 [1237340] 201771 201773 3 chrY 1854206 [1237341] 201774 201784 11 chrY 1854207 [1237342] 201785 201796 12 chrY 1854208 Then get the starts/ends of the views with: start(views) end(views) HTH, H. On 04/30/2013 01:53 PM, Martin Morgan wrote: > On 04/29/2013 11:52 PM, Daniel Berner wrote: >> Martin, Steve >> >> Thanks for your suggestions, much appreciated! My core problem is that >> the >> operations inside the loop are too complex for me to allow >> implementing any >> function() and apply() combination (there are multiple interleaved >> if() lines >> etc). and I was not aware of the data.table package, but this also looks >> hard to me, given the many steps performed within the loop. >> >> to be more specific, I want to reduce a number of Illumina short reads at >> genomic loci to diploid or haploid (depending on coverage) consensus >> genotypes, by evaluating several test criteria. Below I am copying the >> full >> code. the input is an alignment in BAM format. >> >> Any suggestion to make this faster is most welcome, as the code takes >> more >> than a week to run for each individual, and I have to process 96 of >> them (I >> have multicore, but still)! It seems to me that the rate-limiting step >> is the >> subsetting; the other manipulations are fast enough to work in such a >> primitive loop. >> >> cheers, >> Daniel Berner >> Zoological Institute >> University of Basel >> >> ######################### >> library(Rsamtools) >> rm(list=ls()) >> M<-"GAAGC" ### specify MID (individual) >> lib<-25 ### specify library >> >> # genotyping settings: >> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage >> is more than foldcov times the overall mean coverage >> (=length(posi)/length(u.chrpos)). eliminates repeats >> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the >> two dominant haplotypes together is <= prop.1.2, relative to the total >> coverage of the locus. eliminates repeats >> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of >> the coverage of the two dominant haplotypes (or simply the coverage >> for monomorphic loci) is less than sum.1.2 >> het<-1/3 # a diploid locus is called heterozygous if the ratio >> count.htp.2nd/count.htp.dominant is greater than het >> min<-2 # a haploid locus is called when coverage is at least min (but >> lower than sum.1.2) >> >> # infile upload: >> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the >> BAMs >> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are >> the BAMs >> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # >> infile path >> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", >> sep="") # outfile path >> #outfile<-paste("/Users/dani/Documents/genome\ >> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, >> ".consensus.txt",sep="") >> param >> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", >> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) >> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 >> minute >> >> # extract the relevant data: >> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.charac ter(f$rname[grep("-",f$strand,invert=F)])) >> >> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$st rand,invert=F)]) >> > > I was wondering if that 'f$tag' was a typo, it seems to break the > symmetry of these commands...? > >> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand ,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",fstrand,invert=F)]) ) >> >> chrpos<-paste(chr, posi, sep=' ') >> u.chrpos<-sort(unique(chrpos)) >> length(posi) #total number of alignments >> length(u.chrpos) #this gives the number of unique loci >> length(posi)/length(u.chrpos) #mean coverage per locus >> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative >> coverage threshold based on average coverage >> >> # derive the data object to process: >> dat<-data.frame(cbind(chrpos, seqs)) > > probably want data.frame(<...>, stringsAsFactors=FALSE) > >> >> # clean the environment and release memory: >> rm(f, chr, posi, seqs, chrpos) #no longer used >> gc() >> >> # process the dat object while writing to 'cons' >> # outfile data structure: chrpos / coverage / X-Y (needed for SNP >> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq >> cons<-NULL > > A variant of Herve's suggestion to use append=TRUE (his idea is better!) is > > con <- file(outfile, "w") > > which opens the file, and then instead of accumulating the results in > memory you could, inside your loop, > > write.table(<intermediate result="">, con) > > and then when done remember to > > close(con) > > you'll need to make sure not to write headers in the write.table() > statement. But I think this is mostly not relevant in your revised work > flow... > >> for (i in 1:length(u.chrpos)){ > > My suggestion is a little complicated, but you're really interested in > turning this loop into a vector operation. I think you can do this by > sorting your data frame > > dat <- dat[order(datchrpos, dat$seq),] > > It's helpful to have a 'view' onto dat, where the view is a data.frame > with indices marking the beginning and end of each 'chrpos' group. I > created some helper functions > > ends <- function(x) > c([-length(x)] != x[-1], TRUE) > starts <- function(x) > c(TRUE, ends(x)[-length(x)]) > > which creates a logical vector of the same length as x, with a TRUE > whenever a run of identical chrpos start or end (something similar could > be done with the rle() or Rle() functions). So my view is then > > view <- data.frame(start = which(starts(dat$chrpos)), > end = which(ends(dat$chrpos))) > >> stack<-DNAStringSet(as.character(subset(dat, >> chrpos==u.chrpos[i])[,2])) >> stack.le<-length(stack) >> if(stack.le<=c.thr){ > > for this if statement we need to know how many distinct sequences are in > each view. I did this by numbering the sequences sequentially > > seqid <- cumsum(starts(dat$seqs)) > > and then finding the difference between the id of the last and first > sequence > > view$n_seq <- seqid[view$end] - seqid[view$start] + 1L > >> #monomorphic: >> if (length(unique(stack))==1){ > > The 'view' onto the data representing the monomorphic locations is > > mono <- view[view$n_seq == 1, , drop=FALSE] > >> #well covered: >> if(stack.le>=sum.1.2){ > > and the 'well-covered' locations are > > idx <- (mono$end - mono$start + 1L) >= sum.1.2 > ok <- mono[idx, , drop=FALSE] > >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >> as.character(stack[1]), sep=" ") > > and you're after the information > > gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N", > dat[ok$start, "seqs"]) > > (although I think you're really just adding columns and entries to > 'view'...). To emphasize, this processes _all_ the monomorphic, > well-covered sites, without any looping. > > And so on through the rest of your code. > > This type of operation is well-suited to data.table, though I'm not sure > enough of the syntax and implementation to know whether Steve's > > dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) > result <- dat[, { > list(n.reads=.N, n.unique=length(unique(seqs))) > }, by=c('chr', 'pos')] > > is implemented efficiently -- I'm sure the .N is; just not whether > clever thinking is used behind the scenes to avoid looping through > function(x) length(unique(x)). The syntax is certainly clearer than my > 'view' approach. > > Martin > > >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >> as.character(stack[1]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #poorly covered: >> if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >> as.character(stack[1]), sep=" ") >> cons<-rbind(cons, gtp.X) >> } >> } >> #polymorphic: >> if(length(unique(stack))>1){ >> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] >> #no repeat issue: >> if((sum(cnts)/stack.le)>prop.1.2){ >> #heterozygote: >> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", >> as.character(names(cnts)[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", >> as.character(names(cnts)[2]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #homozygote (minor variant too rare, artifact): >> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ >> # minor variant too rare, artifact; hence locus diploid homozygote >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >> as.character(names(cnts)[1]), sep=" ") >> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >> as.character(names(cnts)[1]), sep=" ") >> cons<-rbind(cons, gtp.X, gtp.Y) >> } >> #coverage poor, gtp unclear, haploid call: >> if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ >> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >> as.character(names(cnts)[1]), sep=" ") >> cons<-rbind(cons, gtp.X) >> } >> } >> } >> } >> } >> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write >> to individual file >> ### END >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ADD COMMENTlink modified 5.1 years ago by Steve Lianoglou12k • written 5.1 years ago by Hervé Pagès ♦♦ 13k 0 5.1 years ago by Hervé Pagès ♦♦ 13k United States Hervé Pagès ♦♦ 13k wrote: On 04/30/2013 08:12 PM, Hervé Pagès wrote: > An alternative to sorting the data.frame and creating views on it is to > use a DataFrame and split it: > > dat <- DataFrame(chrpos, seqs) > splitted_dat <- split(dat, dat$chrpos) splitted? mmmh, looks like I should use split_dat instead of splitted_dat... H. > > Unlike splitting a data.frame, which is generally very inefficient > (especially if the split factor has a lot of levels -- don't try this > with > 50000 levels or you'll blow your machine), splitting a DataFrame > is very fast and very efficient. If the split factor has > 20000 levels, > it will typically be hundreds or thousands times faster than with a > data.frame and use hundreds times less memory. Thanks Michael! > > The result of splitting a DataFrame is a SplitDataFrameList object, > which is a particular type of CompressedList object: > > > is(splitted_dat, "CompressedList") > [1] TRUE > > A SplitDataFrameList, like any CompressedList object, can be unlisted > in the blink of an eye: > > > system.time(dat2 <- unlist(splitted_dat)) > user system elapsed > 0.000 0.000 0.001 > > So what was the point of splitting and now unlisting? In 'dat2', all > the rows that correspond to the same locus (i.e. same dat2chrpos) are > grouped together, like in Martin's sorted data.frame, and the indices > of the starting and ending row of each group (Martin's views) can be > obtained by doing PartitioningByEnd() on 'splitted_dat': > > views <- PartitioningByEnd(splitted_dat) > > > views > PartitioningByEnd of length 1237342 > start end width names > [1] 1 3 3 chr1 1 > [2] 4 14 11 chr1 10 > [3] 15 20 6 chr1 100 > [4] 21 27 7 chr1 1000 > ... ... ... ... ... > [1237339] 201767 201770 4 chrY 1854205 > [1237340] 201771 201773 3 chrY 1854206 > [1237341] 201774 201784 11 chrY 1854207 > [1237342] 201785 201796 12 chrY 1854208 > > Then get the starts/ends of the views with: > > start(views) > end(views) > > HTH, > H. > > > On 04/30/2013 01:53 PM, Martin Morgan wrote: >> On 04/29/2013 11:52 PM, Daniel Berner wrote: >>> Martin, Steve >>> >>> Thanks for your suggestions, much appreciated! My core problem is that >>> the >>> operations inside the loop are too complex for me to allow >>> implementing any >>> function() and apply() combination (there are multiple interleaved >>> if() lines >>> etc). and I was not aware of the data.table package, but this also looks >>> hard to me, given the many steps performed within the loop. >>> >>> to be more specific, I want to reduce a number of Illumina short >>> reads at >>> genomic loci to diploid or haploid (depending on coverage) consensus >>> genotypes, by evaluating several test criteria. Below I am copying the >>> full >>> code. the input is an alignment in BAM format. >>> >>> Any suggestion to make this faster is most welcome, as the code takes >>> more >>> than a week to run for each individual, and I have to process 96 of >>> them (I >>> have multicore, but still)! It seems to me that the rate-limiting step >>> is the >>> subsetting; the other manipulations are fast enough to work in such a >>> primitive loop. >>> >>> cheers, >>> Daniel Berner >>> Zoological Institute >>> University of Basel >>> >>> ######################### >>> library(Rsamtools) >>> rm(list=ls()) >>> M<-"GAAGC" ### specify MID (individual) >>> lib<-25 ### specify library >>> >>> # genotyping settings: >>> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage >>> is more than foldcov times the overall mean coverage >>> (=length(posi)/length(u.chrpos)). eliminates repeats >>> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the >>> two dominant haplotypes together is <= prop.1.2, relative to the total >>> coverage of the locus. eliminates repeats >>> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of >>> the coverage of the two dominant haplotypes (or simply the coverage >>> for monomorphic loci) is less than sum.1.2 >>> het<-1/3 # a diploid locus is called heterozygous if the ratio >>> count.htp.2nd/count.htp.dominant is greater than het >>> min<-2 # a haploid locus is called when coverage is at least min (but >>> lower than sum.1.2) >>> >>> # infile upload: >>> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the >>> BAMs >>> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are >>> the BAMs >>> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # >>> infile path >>> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", >>> sep="") # outfile path >>> #outfile<-paste("/Users/dani/Documents/genome\ >>> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, >>> ".consensus.txt",sep="") >>> param >>> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", >>> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE) >>> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 >>> minute >>> >>> # extract the relevant data: >>> chr<-c(as.character(frname[grep("-",f$strand,invert=T)]),as.chara cter(f$rname[grep("-",f$strand,invert=F)])) >>> >>> >>> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$s trand,invert=F)]) >>> >>> >> >> I was wondering if that 'f$tag' was a typo, it seems to break the >> symmetry of these commands...? >> >>> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$stran d,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)] )) >>> >>> >>> chrpos<-paste(chr, posi, sep=' ') >>> u.chrpos<-sort(unique(chrpos)) >>> length(posi) #total number of alignments >>> length(u.chrpos) #this gives the number of unique loci >>> length(posi)/length(u.chrpos) #mean coverage per locus >>> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative >>> coverage threshold based on average coverage >>> >>> # derive the data object to process: >>> dat<-data.frame(cbind(chrpos, seqs)) >> >> probably want data.frame(<...>, stringsAsFactors=FALSE) >> >>> >>> # clean the environment and release memory: >>> rm(f, chr, posi, seqs, chrpos) #no longer used >>> gc() >>> >>> # process the dat object while writing to 'cons' >>> # outfile data structure: chrpos / coverage / X-Y (needed for SNP >>> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR- haploid_low(L) / >>> seq >>> cons<-NULL >> >> A variant of Herve's suggestion to use append=TRUE (his idea is >> better!) is >> >> con <- file(outfile, "w") >> >> which opens the file, and then instead of accumulating the results in >> memory you could, inside your loop, >> >> write.table(<intermediate result="">, con) >> >> and then when done remember to >> >> close(con) >> >> you'll need to make sure not to write headers in the write.table() >> statement. But I think this is mostly not relevant in your revised work >> flow... >> >>> for (i in 1:length(u.chrpos)){ >> >> My suggestion is a little complicated, but you're really interested in >> turning this loop into a vector operation. I think you can do this by >> sorting your data frame >> >> dat <- dat[order(dat$chrpos, dat$seq),] >> >> It's helpful to have a 'view' onto dat, where the view is a data.frame >> with indices marking the beginning and end of each 'chrpos' group. I >> created some helper functions >> >> ends <- function(x) >> c([-length(x)] != x[-1], TRUE) >> starts <- function(x) >> c(TRUE, ends(x)[-length(x)]) >> >> which creates a logical vector of the same length as x, with a TRUE >> whenever a run of identical chrpos start or end (something similar could >> be done with the rle() or Rle() functions). So my view is then >> >> view <- data.frame(start = which(starts(dat$chrpos)), >> end = which(ends(dat$chrpos))) >> >>> stack<-DNAStringSet(as.character(subset(dat, >>> chrpos==u.chrpos[i])[,2])) >>> stack.le<-length(stack) >>> if(stack.le<=c.thr){ >> >> for this if statement we need to know how many distinct sequences are in >> each view. I did this by numbering the sequences sequentially >> >> seqid <- cumsum(starts(dat$seqs)) >> >> and then finding the difference between the id of the last and first >> sequence >> >> view$n_seq <- seqid[view$end] - seqid[view$start] + 1L >> >>> #monomorphic: >>> if (length(unique(stack))==1){ >> >> The 'view' onto the data representing the monomorphic locations is >> >> mono <- view[view$n_seq == 1, , drop=FALSE] >> >>> #well covered: >>> if(stack.le>=sum.1.2){ >> >> and the 'well-covered' locations are >> >> idx <- (mono$end - mono$start + 1L) >= sum.1.2 >> ok <- mono[idx, , drop=FALSE] >> >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >>> as.character(stack[1]), sep=" ") >> >> and you're after the information >> >> gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N", >> dat[ok$start, "seqs"]) >> >> (although I think you're really just adding columns and entries to >> 'view'...). To emphasize, this processes _all_ the monomorphic, >> well-covered sites, without any looping. >> >> And so on through the rest of your code. >> >> This type of operation is well-suited to data.table, though I'm not sure >> enough of the syntax and implementation to know whether Steve's >> >> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) >> result <- dat[, { >> list(n.reads=.N, n.unique=length(unique(seqs))) >> }, by=c('chr', 'pos')] >> >> is implemented efficiently -- I'm sure the .N is; just not whether >> clever thinking is used behind the scenes to avoid looping through >> function(x) length(unique(x)). The syntax is certainly clearer than my >> 'view' approach. >> >> Martin >> >> >>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >>> as.character(stack[1]), sep=" ") >>> cons<-rbind(cons, gtp.X, gtp.Y) >>> } >>> #poorly covered: >>> if((stack.le<sum.1.2) &&="" (stack.le="">=min)){ >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >>> as.character(stack[1]), sep=" ") >>> cons<-rbind(cons, gtp.X) >>> } >>> } >>> #polymorphic: >>> if(length(unique(stack))>1){ >>> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2] >>> #no repeat issue: >>> if((sum(cnts)/stack.le)>prop.1.2){ >>> #heterozygote: >>> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){ >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", >>> as.character(names(cnts)[1]), sep=" ") >>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", >>> as.character(names(cnts)[2]), sep=" ") >>> cons<-rbind(cons, gtp.X, gtp.Y) >>> } >>> #homozygote (minor variant too rare, artifact): >>> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ >>> # minor variant too rare, artifact; hence locus diploid homozygote >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", >>> as.character(names(cnts)[1]), sep=" ") >>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", >>> as.character(names(cnts)[1]), sep=" ") >>> cons<-rbind(cons, gtp.X, gtp.Y) >>> } >>> #coverage poor, gtp unclear, haploid call: >>> if((sum(cnts)<sum.1.2) &&="" (cnts[1]="">=min)){ >>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", >>> as.character(names(cnts)[1]), sep=" ") >>> cons<-rbind(cons, gtp.X) >>> } >>> } >>> } >>> } >>> } >>> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write >>> to individual file >>> ### END >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
5.1 years ago by
Denali
Steve Lianoglou12k wrote:
Hi folks, Wow -- that's a lot of great suggestions coming straight from the bioc-wizards themselves. I'm still not going to jump into the details on the logic of what Daniel *really* wants, just want to make a comment on Martin's last point: > This type of operation is well-suited to data.table, though I'm not sure > enough of the syntax and implementation to know whether Steve's > > > dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) > result <- dat[, { > list(n.reads=.N, n.unique=length(unique(seqs))) > }, by=c('chr', 'pos')] > > is implemented efficiently -- I'm sure the .N is; just not whether clever > thinking is used behind the scenes to avoid looping through function(x) > length(unique(x)). The syntax is certainly clearer than my 'view' approach. I only really used this as a pedagogical example to show that one could access subsets of the columns directly by name within the j expression of the [.data.table function. The .N is essentially a no-op to call as it is already computed for you, but repeatedly calling a function within each grouped subset will incur the overhead of a function call within each subgroup. Still, I think the OP would notice a significant boost in performance by simply naively translating his code using data.table -- if you really wanted to eek out the last bit of performance (which isn't really necessary if you're just doing things once, but if you're building a pipeline, fell free) that'd be another convo ... Anyway, it looks like there's a lot of good stuff in this thread already. I'd be curious to here back from Daniel when he tries a few of these things. Also, wasn't aware of the new(?) SplitDataFrame mojo -- very nice stuff. -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech
Hi Steve, On 05/01/2013 09:50 AM, Steve Lianoglou wrote: > Hi folks, > > Wow -- that's a lot of great suggestions coming straight from the > bioc-wizards themselves. > > I'm still not going to jump into the details on the logic of what > Daniel *really* wants, just want to make a comment on Martin's last > point: > >> This type of operation is well-suited to data.table, though I'm not sure >> enough of the syntax and implementation to know whether Steve's >> >> >> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos')) >> result <- dat[, { >> list(n.reads=.N, n.unique=length(unique(seqs))) >> }, by=c('chr', 'pos')] >> >> is implemented efficiently -- I'm sure the .N is; just not whether clever >> thinking is used behind the scenes to avoid looping through function(x) >> length(unique(x)). The syntax is certainly clearer than my 'view' approach. > > I only really used this as a pedagogical example to show that one > could access subsets of the columns directly by name within the j > expression of the [.data.table function. > > The .N is essentially a no-op to call as it is already computed for > you, but repeatedly calling a function within each grouped subset will > incur the overhead of a function call within each subgroup. > > Still, I think the OP would notice a significant boost in performance > by simply naively translating his code using data.table -- if you > really wanted to eek out the last bit of performance (which isn't > really necessary if you're just doing things once, but if you're > building a pipeline, fell free) that'd be another convo ... > > Anyway, it looks like there's a lot of good stuff in this thread > already. I'd be curious to here back from Daniel when he tries a few > of these things. Also, wasn't aware of the new(?) SplitDataFrame > mojo -- very nice stuff. This has been in IRanges since the beginning of the package (2008). IIRC when Michael added DataFrame/SplitDataFrameList to the package, the primary use case was to store the values of a RangedData object (the "values" slot of RangedData is SplitDataFrameList). If *you* were not aware of SplitDataFrameList, then that means that *we* are failing to properly communicate/document/expose/advertise the IRanges/GenomicRanges infrastructure :-/ H. > > -steve > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319