Deleting object rows while looping
1
0
Entering edit mode
@danielbernerunibasch-4268
Last seen 3.1 years ago
Hi Can someone help me with this question: I have a large data frame (say 'dat') with 2 columns, one is genomic loci (chromosome-by-position, e.g. 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some operations on each UNIQUE locus. I thus derive the unique loci: u.loc<-unique(dat[,1]) and build a loop allowing me to access the relevant data for each unique locus, and to perform my operations: for(i in 1:length(u.loc)){ subdat<-subset(dat, dat[,1]==u.loc[i]) # now the relevant sequence data are accessible for my operations... } This works fine. But since I have some 10 million rows in the dat object, the subset() call takes time, hence the whole code is slow. I would therefore like to get rid of the rows already processed within the loop, which would speed up the code as it progresses. I therefore thought about adding this as the last line within the loop: dat<-dat[-as.integer(row.names(subdat)),] This should eliminate the processed lines and continuously reduce the dat object’s volume. However, the output I get when using this latter line is incorrect, it does not agree with the output I get without row deletion. It seems deletion does not work correctly. Any idea why this is, and how I could do the row elimination properly? Thanks! Daniel Berner Zoological Institute University of Basel Vesalgasse 1 4051 Basel Switzerland +41 (0)61 267 0328 daniel.berner@unibas.ch [[alternative HTML version deleted]]
• 879 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
On 04/29/2013 01:28 AM, Daniel Berner wrote: > Hi > Can someone help me with this question: I have a large data frame (say 'dat') with 2 columns, one is genomic loci (chromosome-by- position, e.g. 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some operations on each UNIQUE locus. I thus derive the unique loci: > > u.loc<-unique(dat[,1]) > > ? and build a loop allowing me to access the relevant data for each unique locus, and to perform my operations: > > for(i in 1:length(u.loc)){ > subdat<-subset(dat, dat[,1]==u.loc[i]) > # now the relevant sequence data are accessible for my operations... > } Hi Daniel -- One possibility is to use split lst = split(dat[,2], dat[,1]) (it would be very expensive to create, say, 1 million data.frames in a call like split(dat, dat[,1]); stick with vectors only if possible) and then l/s/mapply result = lapply(lst, doWork) but probably better is to think about how to implement 'doWork' so that it operates on the entire vector of sequences, to avoid the cost of invoking doWork on each unique value of dat[,1]. Hints about what is in 'doWork' might lead to some suggestions on how to make it vectorized (or functions that already implement this efficiently). Martin > > This works fine. But since I have some 10 million rows in the dat object, the subset() call takes time, hence the whole code is slow. I would therefore like to get rid of the rows already processed within the loop, which would speed up the code as it progresses. I therefore thought about adding this as the last line within the loop: > > dat<-dat[-as.integer(row.names(subdat)),] > > This should eliminate the processed lines and continuously reduce the dat object?s volume. However, the output I get when using this latter line is incorrect, it does not agree with the output I get without row deletion. It seems deletion does not work correctly. Any idea why this is, and how I could do the row elimination properly? > > Thanks! > > Daniel Berner > Zoological Institute > University of Basel > Vesalgasse 1 > 4051 Basel > Switzerland > +41 (0)61 267 0328 > daniel.berner at unibas.ch > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Greetings, On Mon, Apr 29, 2013 at 6:39 AM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 04/29/2013 01:28 AM, Daniel Berner wrote: >> >> Hi >> Can someone help me with this question: I have a large data frame (say >> 'dat') with 2 columns, one is genomic loci (chromosome-by-position, e.g. >> 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some >> operations on each UNIQUE locus. I thus derive the unique loci: >> >> u.loc<-unique(dat[,1]) >> >> ? and build a loop allowing me to access the relevant data for each unique >> locus, and to perform my operations: >> >> for(i in 1:length(u.loc)){ >> subdat<-subset(dat, dat[,1]==u.loc[i]) >> # now the relevant sequence data are accessible for my >> operations... >> } > > > Hi Daniel -- > > One possibility is to use split > > lst = split(dat[,2], dat[,1]) > > (it would be very expensive to create, say, 1 million data.frames in a call > like split(dat, dat[,1]); stick with vectors only if possible) and then > l/s/mapply > > result = lapply(lst, doWork) > > but probably better is to think about how to implement 'doWork' so that it > operates on the entire vector of sequences, to avoid the cost of invoking > doWork on each unique value of dat[,1]. Hints about what is in 'doWork' > might lead to some suggestions on how to make it vectorized (or functions > that already implement this efficiently). These are the situations where the data.table package really shines, if you'd like to give it a shot. Say the colnames of your data.frame are "locus" and "sequence", you would do something like so: dt <- data.table(dat, key="locus") result <- [, { ## The sequences that are in the current subset ## are injected into the scope of this expression ## by the name of their column (`sequence`), say ## you wanted to count the number of GATACA ## motifs in this subset (or something): list(n.motifs=length(grep('GATACA', sequence)) }, by='locus'] I'd probably store this data with chromosome and position split, ie. with column names such as "chr", "pos", "sequence", then convert this to a data.table and set the key to be c("chr", "pos"), eg: dt <- data.table(dat, key=c("chr", "pos")) result <- [, { list(n.motifs=length(grep('GATACA', sequence)) }, by=c("chr", "pos")] You should find a large difference (for the better) in terms of speed and memory use as compared to other approaches (split, ddply, etc.) given the size of your data -- of course sometimes ||-ization is the right way to go, but you can try both and see. HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech
ADD REPLY
0
Entering edit mode
Sorry -- I'm not quite in full swing yet ... this new monitor is making me read things funny: > given the size of your data -- of course sometimes ||-ization is the > right way to go, but you can try both and see. I briefly skimmed Martin's post and thought he suggested ||-ization, which he didn't -- although his approach w/ the work implemented in `doWork` does lend it self nicely to ||-ization in the future, if need be, but that's neither here nor there ... even if some of us are actually here, and others are there. -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech
ADD REPLY

Login before adding your answer.

Traffic: 534 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6