finding and deleting repeated observations

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 16 days ago

United States

On Mon, May 31, 2010 at 9:41 AM, Mervi Kinnunen <mervi.kinnunen@wri.fi>wrote: > Hi, > > > > Thanks for helping me out. However, I couldn't get the script to work. > Below > is the description. How does the t(sapply . script select the minimum > p-value? I understand that the split creates a list where each occurring > geneSymbol is present in a separate data frame. How does the script then > compare the p-values within each frame and merge the data back into a > single > data frame? > > > > -Mervi > > > > > dd <- read.table("Myfile", sep='\t', h=T, as.is=T, > colClasses=c("character","numeric","numeric","numeric")) > > > str(dd) > > 'data.frame': 6 obs. of 4 variables: > > $ geneSymbol: chr "ABC1" "ABC1" "AB" "ABCD1" ... > > $ A : num 12 2 4 15 11 9 > > $ B : num 44 32 55 25 27 18 > > $ pvalue : num 1e-02 5e-02 2e-01 5e-03 2e-03 1e-04 > > > bb<- dd > > > bbs <- split(bb,bb[,1]) > > > d<- t(sapply(bbs, function(x)x[which.min(x$originalpvalue),])) > > there is no column in dd called 'originalpvalue' so your variation must fail. use 'pvalue' > > str(d) > > List of 12 > > $ : chr(0) > > $ : chr(0) > > $ : chr(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > - attr(*, "dim")= int [1:2] 3 4 > > - attr(*, "dimnames")=List of 2 > > ..$ : chr [1:3] "AB" "ABC1" "ABCD1" > > ..$ : chr [1:4] "geneSymbol" "A" "B" "pvalue" > > > head(d) > > geneSymbol A B pvalue > > AB Character,0 Numeric,0 Numeric,0 Numeric,0 > > ABC1 Character,0 Numeric,0 Numeric,0 Numeric,0 > > ABCD1 Character,0 Numeric,0 Numeric,0 Numeric,0 > > From: Vincent Carey [mailto:stvjc@channing.harvard.edu] > Sent: 29. toukokuuta 2010 0:25 > To: mervi.alanne@wri.fi > Cc: bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] finding and deleting repeated observations > > > > suppose you save your data as in the email to a file b.txt -- i ignore > niceties of delimiter choice > > there are many ways of doing it, but here is one possibility > > > bb = read.table("b.txt", h=TRUE, colClasses=c("character", "numeric", > "numeric", "numeric")) > > bbs = split(bb, bb[,1]) > > t(sapply(bbs, function(x) x[which.min(x$pvalue),])) > GeneSymbol A B pvalue > AB "AB" 4 55 0.2 > ABC1 "ABC1" 12 44 0.01 > ABCD1 "ABCD1" 9 18 1e-04 > > it does what you ask, but the solution you gave below doesn't seem right > (picked wrong values of A and B for correct ABC1 candidate?) > > On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi <mervi.alanne@wri.fi> > wrote: > > Dear all, > > I'm a novice with R and could use some help. How could I find repeated > observations based on one column and select the one to keep based on > another column? > > In more detail, this is the thing I want to achieve: > -data.frame has 4 columns GeneSymbol, A, B, pvalue > -data in column GeneSymbol may be repeated 1-6 times > -data also contains unique observations > -Of the repeated obs, keep the obs which has the lowest pvalue > -Do not discard data from cols A and B > > Example input data: > GeneSymbol A B pvalue > ABC1 12 44 0.01 > ABC1 2 32 0.05 > AB 4 55 0.2 > ABCD1 15 25 0.005 > ABCD1 11 27 0.002 > ABCD1 9 18 0.0001 > > I'd like the output to look like this: > GeneSymbol A B pvalue > ABC1 2 32 0.01 > AB 4 55 0.2 > ABCD1 9 18 0.0001 > > Any suggestions? > > -Mervi > Wihuri Research Institute > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

• 1.3k views

ADD COMMENT • link updated 15.7 years ago by Scott Ochsner ▴ 300 • written 15.7 years ago by Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Scott Ochsner ▴ 300

@scott-ochsner-599

Last seen 11.4 years ago

Hi Mervi, One solution is to order your data frame by "pvalue" using the order function and then to remove duplicate "GeneSymbol" using !duplicated. > A<-c(12,2,4,15,11,9) > B<-c(44,32,55,25,27,18) > pvalue<-c(.01,.05,.2,.005,.002,.0001) > GeneSymbol<-c(rep("ABC1",2),"AB",rep("ABCD1",3)) > tmp<-as.data.frame(cbind(A,B,pvalue)) > tmp<-cbind(GeneSymbol,tmp) > tmp GeneSymbol A B pvalue 1 ABC1 12 44 1e-02 2 ABC1 2 32 5e-02 3 AB 4 55 2e-01 4 ABCD1 15 25 5e-03 5 ABCD1 11 27 2e-03 6 ABCD1 9 18 1e-04 ## reorder your dataframe by pvalue > tmp.ordered <- tmp[order(tmp$pvalue),] > tmp.ordered GeneSymbol A B pvalue 6 ABCD1 9 18 1e-04 5 ABCD1 11 27 2e-03 4 ABCD1 15 25 5e-03 1 ABC1 12 44 1e-02 2 ABC1 2 32 5e-02 3 AB 4 55 2e-01 ## select the first instance of a gene symbol and remove all others. Because you have ordered by pvalues you will automatically select the gene symbol with the lowest pvalue. > tmp.sub<- tmp.ordered[!duplicated(tmp.ordered$GeneSymbol),] > tmp.sub GeneSymbol A B pvalue 6 ABCD1 9 18 1e-04 1 ABC1 12 44 1e-02 3 AB 4 55 2e-01 ## reorder your data frame as before using the rownames. > tmp.sub<-tmp.sub[order(rownames(tmp.sub)),] > tmp.sub GeneSymbol A B pvalue 1 ABC1 12 44 1e-02 3 AB 4 55 2e-01 6 ABCD1 9 18 1e-04 Scott Scott A. Ochsner, PhD One Baylor Plaza BCM130, Houston, TX 77030 Voice: (713) 798-6227 Fax: (713) 790-1275 -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor- bounces@stat.math.ethz.ch] On Behalf Of mervi.alanne@wri.fi Sent: Friday, May 28, 2010 12:27 PM To: bioconductor at stat.math.ethz.ch Subject: [BioC] finding and deleting repeated observations Dear all, I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column? In more detail, this is the thing I want to achieve: -data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B Example input dat GeneSymbol A B pvalue ABC1 12 44 0.01 ABC1 2 32 0.05 AB 4 55 0.2 ABCD1 15 25 0.005 ABCD1 11 27 0.002 ABCD1 9 18 0.0001 I'd like the output to look like this: GeneSymbol A B pvalue ABC1 2 32 0.01 AB 4 55 0.2 ABCD1 9 18 0.0001 Any suggestions? -Mervi Wihuri Research Institute _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.7 years ago Scott Ochsner ▴ 300

Login before adding your answer.