finding and deleting repeated observations
1
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 3 months ago
United States
On Mon, May 31, 2010 at 9:41 AM, Mervi Kinnunen <mervi.kinnunen@wri.fi>wrote: > Hi, > > > > Thanks for helping me out. However, I couldn't get the script to work. > Below > is the description. How does the t(sapply . script select the minimum > p-value? I understand that the split creates a list where each occurring > geneSymbol is present in a separate data frame. How does the script then > compare the p-values within each frame and merge the data back into a > single > data frame? > > > > -Mervi > > > > > dd <- read.table("Myfile", sep='\t', h=T, as.is=T, > colClasses=c("character","numeric","numeric","numeric")) > > > str(dd) > > 'data.frame': 6 obs. of 4 variables: > > $ geneSymbol: chr "ABC1" "ABC1" "AB" "ABCD1" ... > > $ A : num 12 2 4 15 11 9 > > $ B : num 44 32 55 25 27 18 > > $ pvalue : num 1e-02 5e-02 2e-01 5e-03 2e-03 1e-04 > > > bb<- dd > > > bbs <- split(bb,bb[,1]) > > > d<- t(sapply(bbs, function(x)x[which.min(x$originalpvalue),])) > > there is no column in dd called 'originalpvalue' so your variation must fail. use 'pvalue' > > str(d) > > List of 12 > > $ : chr(0) > > $ : chr(0) > > $ : chr(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > $ : num(0) > > - attr(*, "dim")= int [1:2] 3 4 > > - attr(*, "dimnames")=List of 2 > > ..$ : chr [1:3] "AB" "ABC1" "ABCD1" > > ..$ : chr [1:4] "geneSymbol" "A" "B" "pvalue" > > > head(d) > > geneSymbol A B pvalue > > AB Character,0 Numeric,0 Numeric,0 Numeric,0 > > ABC1 Character,0 Numeric,0 Numeric,0 Numeric,0 > > ABCD1 Character,0 Numeric,0 Numeric,0 Numeric,0 > > From: Vincent Carey [mailto:stvjc@channing.harvard.edu] > Sent: 29. toukokuuta 2010 0:25 > To: mervi.alanne@wri.fi > Cc: bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] finding and deleting repeated observations > > > > suppose you save your data as in the email to a file b.txt -- i ignore > niceties of delimiter choice > > there are many ways of doing it, but here is one possibility > > > bb = read.table("b.txt", h=TRUE, colClasses=c("character", "numeric", > "numeric", "numeric")) > > bbs = split(bb, bb[,1]) > > t(sapply(bbs, function(x) x[which.min(x$pvalue),])) > GeneSymbol A B pvalue > AB "AB" 4 55 0.2 > ABC1 "ABC1" 12 44 0.01 > ABCD1 "ABCD1" 9 18 1e-04 > > it does what you ask, but the solution you gave below doesn't seem right > (picked wrong values of A and B for correct ABC1 candidate?) > > On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi <mervi.alanne@wri.fi> > wrote: > > Dear all, > > I'm a novice with R and could use some help. How could I find repeated > observations based on one column and select the one to keep based on > another column? > > In more detail, this is the thing I want to achieve: > -data.frame has 4 columns GeneSymbol, A, B, pvalue > -data in column GeneSymbol may be repeated 1-6 times > -data also contains unique observations > -Of the repeated obs, keep the obs which has the lowest pvalue > -Do not discard data from cols A and B > > Example input data: > GeneSymbol A B pvalue > ABC1 12 44 0.01 > ABC1 2 32 0.05 > AB 4 55 0.2 > ABCD1 15 25 0.005 > ABCD1 11 27 0.002 > ABCD1 9 18 0.0001 > > I'd like the output to look like this: > GeneSymbol A B pvalue > ABC1 2 32 0.01 > AB 4 55 0.2 > ABCD1 9 18 0.0001 > > Any suggestions? > > -Mervi > Wihuri Research Institute > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
• 1.1k views
ADD COMMENT
0
Entering edit mode
Scott Ochsner ▴ 300
@scott-ochsner-599
Last seen 10.3 years ago
Hi Mervi, One solution is to order your data frame by "pvalue" using the order function and then to remove duplicate "GeneSymbol" using !duplicated. > A<-c(12,2,4,15,11,9) > B<-c(44,32,55,25,27,18) > pvalue<-c(.01,.05,.2,.005,.002,.0001) > GeneSymbol<-c(rep("ABC1",2),"AB",rep("ABCD1",3)) > tmp<-as.data.frame(cbind(A,B,pvalue)) > tmp<-cbind(GeneSymbol,tmp) > tmp GeneSymbol A B pvalue 1 ABC1 12 44 1e-02 2 ABC1 2 32 5e-02 3 AB 4 55 2e-01 4 ABCD1 15 25 5e-03 5 ABCD1 11 27 2e-03 6 ABCD1 9 18 1e-04 ## reorder your dataframe by pvalue > tmp.ordered <- tmp[order(tmp$pvalue),] > tmp.ordered GeneSymbol A B pvalue 6 ABCD1 9 18 1e-04 5 ABCD1 11 27 2e-03 4 ABCD1 15 25 5e-03 1 ABC1 12 44 1e-02 2 ABC1 2 32 5e-02 3 AB 4 55 2e-01 ## select the first instance of a gene symbol and remove all others. Because you have ordered by pvalues you will automatically select the gene symbol with the lowest pvalue. > tmp.sub<- tmp.ordered[!duplicated(tmp.ordered$GeneSymbol),] > tmp.sub GeneSymbol A B pvalue 6 ABCD1 9 18 1e-04 1 ABC1 12 44 1e-02 3 AB 4 55 2e-01 ## reorder your data frame as before using the rownames. > tmp.sub<-tmp.sub[order(rownames(tmp.sub)),] > tmp.sub GeneSymbol A B pvalue 1 ABC1 12 44 1e-02 3 AB 4 55 2e-01 6 ABCD1 9 18 1e-04 Scott Scott A. Ochsner, PhD One Baylor Plaza BCM130, Houston, TX 77030 Voice: (713) 798-6227 Fax: (713) 790-1275 -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor- bounces@stat.math.ethz.ch] On Behalf Of mervi.alanne@wri.fi Sent: Friday, May 28, 2010 12:27 PM To: bioconductor at stat.math.ethz.ch Subject: [BioC] finding and deleting repeated observations Dear all, I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column? In more detail, this is the thing I want to achieve: -data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B Example input dat GeneSymbol A B pvalue ABC1 12 44 0.01 ABC1 2 32 0.05 AB 4 55 0.2 ABCD1 15 25 0.005 ABCD1 11 27 0.002 ABCD1 9 18 0.0001 I'd like the output to look like this: GeneSymbol A B pvalue ABC1 2 32 0.01 AB 4 55 0.2 ABCD1 9 18 0.0001 Any suggestions? -Mervi Wihuri Research Institute _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 439 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6