finding and deleting repeated observations
1
0
Entering edit mode
Mervi Alanne ▴ 70
@mervi-alanne-3731
Last seen 10.3 years ago
Dear all, I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column? In more detail, this is the thing I want to achieve: -data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B Example input data: GeneSymbol A B pvalue ABC1 12 44 0.01 ABC1 2 32 0.05 AB 4 55 0.2 ABCD1 15 25 0.005 ABCD1 11 27 0.002 ABCD1 9 18 0.0001 I'd like the output to look like this: GeneSymbol A B pvalue ABC1 2 32 0.01 AB 4 55 0.2 ABCD1 9 18 0.0001 Any suggestions? -Mervi Wihuri Research Institute
• 909 views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 3 months ago
United States
suppose you save your data as in the email to a file b.txt -- i ignore niceties of delimiter choice there are many ways of doing it, but here is one possibility > bb = read.table("b.txt", h=TRUE, colClasses=c("character", "numeric", "numeric", "numeric")) > bbs = split(bb, bb[,1]) > t(sapply(bbs, function(x) x[which.min(x$pvalue),])) GeneSymbol A B pvalue AB "AB" 4 55 0.2 ABC1 "ABC1" 12 44 0.01 ABCD1 "ABCD1" 9 18 1e-04 it does what you ask, but the solution you gave below doesn't seem right (picked wrong values of A and B for correct ABC1 candidate?) On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi <mervi.alanne@wri.fi>wrote: > Dear all, > > I'm a novice with R and could use some help. How could I find repeated > observations based on one column and select the one to keep based on > another column? > > In more detail, this is the thing I want to achieve: > -data.frame has 4 columns GeneSymbol, A, B, pvalue > -data in column GeneSymbol may be repeated 1-6 times > -data also contains unique observations > -Of the repeated obs, keep the obs which has the lowest pvalue > -Do not discard data from cols A and B > > Example input data: > GeneSymbol A B pvalue > ABC1 12 44 0.01 > ABC1 2 32 0.05 > AB 4 55 0.2 > ABCD1 15 25 0.005 > ABCD1 11 27 0.002 > ABCD1 9 18 0.0001 > > I'd like the output to look like this: > GeneSymbol A B pvalue > ABC1 2 32 0.01 > AB 4 55 0.2 > ABCD1 9 18 0.0001 > > Any suggestions? > > -Mervi > Wihuri Research Institute > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi, Thanks for helping me out. However, I couldn't get the script to work. Below is the description. How does the t(sapply . script select the minimum p-value? I understand that the split creates a list where each occurring geneSymbol is present in a separate data frame. How does the script then compare the p-values within each frame and merge the data back into a single data frame? -Mervi > dd <- read.table("Myfile", sep='\t', h=T, as.is=T, colClasses=c("character","numeric","numeric","numeric")) > str(dd) 'data.frame': 6 obs. of 4 variables: $ geneSymbol: chr "ABC1" "ABC1" "AB" "ABCD1" ... $ A : num 12 2 4 15 11 9 $ B : num 44 32 55 25 27 18 $ pvalue : num 1e-02 5e-02 2e-01 5e-03 2e-03 1e-04 > bb<- dd > bbs <- split(bb,bb[,1]) > d<- t(sapply(bbs, function(x)x[which.min(x$originalpvalue),])) > str(d) List of 12 $ : chr(0) $ : chr(0) $ : chr(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) - attr(*, "dim")= int [1:2] 3 4 - attr(*, "dimnames")=List of 2 ..$ : chr [1:3] "AB" "ABC1" "ABCD1" ..$ : chr [1:4] "geneSymbol" "A" "B" "pvalue" > head(d) geneSymbol A B pvalue AB Character,0 Numeric,0 Numeric,0 Numeric,0 ABC1 Character,0 Numeric,0 Numeric,0 Numeric,0 ABCD1 Character,0 Numeric,0 Numeric,0 Numeric,0 From: Vincent Carey [mailto:stvjc@channing.harvard.edu] Sent: 29. toukokuuta 2010 0:25 To: mervi.alanne@wri.fi Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] finding and deleting repeated observations suppose you save your data as in the email to a file b.txt -- i ignore niceties of delimiter choice there are many ways of doing it, but here is one possibility > bb = read.table("b.txt", h=TRUE, colClasses=c("character", "numeric", "numeric", "numeric")) > bbs = split(bb, bb[,1]) > t(sapply(bbs, function(x) x[which.min(x$pvalue),])) GeneSymbol A B pvalue AB "AB" 4 55 0.2 ABC1 "ABC1" 12 44 0.01 ABCD1 "ABCD1" 9 18 1e-04 it does what you ask, but the solution you gave below doesn't seem right (picked wrong values of A and B for correct ABC1 candidate?) On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi <mervi.alanne@wri.fi> wrote: Dear all, I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column? In more detail, this is the thing I want to achieve: -data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B Example input data: GeneSymbol A B pvalue ABC1 12 44 0.01 ABC1 2 32 0.05 AB 4 55 0.2 ABCD1 15 25 0.005 ABCD1 11 27 0.002 ABCD1 9 18 0.0001 I'd like the output to look like this: GeneSymbol A B pvalue ABC1 2 32 0.01 AB 4 55 0.2 ABCD1 9 18 0.0001 Any suggestions? -Mervi Wihuri Research Institute _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 378 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6