finding and deleting repeated observations

0

Entering edit mode

Mervi Alanne ▴ 70

@mervi-alanne-3731

Last seen 11.4 years ago

Dear all, I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column? In more detail, this is the thing I want to achieve: -data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B Example input data: GeneSymbol A B pvalue ABC1 12 44 0.01 ABC1 2 32 0.05 AB 4 55 0.2 ABCD1 15 25 0.005 ABCD1 11 27 0.002 ABCD1 9 18 0.0001 I'd like the output to look like this: GeneSymbol A B pvalue ABC1 2 32 0.01 AB 4 55 0.2 ABCD1 9 18 0.0001 Any suggestions? -Mervi Wihuri Research Institute

• 1.1k views

ADD COMMENT • link updated 15.7 years ago by Vincent J. Carey, Jr. 6.7k • written 15.7 years ago by Mervi Alanne ▴ 70

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 1 day ago

United States

suppose you save your data as in the email to a file b.txt -- i ignore niceties of delimiter choice there are many ways of doing it, but here is one possibility > bb = read.table("b.txt", h=TRUE, colClasses=c("character", "numeric", "numeric", "numeric")) > bbs = split(bb, bb[,1]) > t(sapply(bbs, function(x) x[which.min(x$pvalue),])) GeneSymbol A B pvalue AB "AB" 4 55 0.2 ABC1 "ABC1" 12 44 0.01 ABCD1 "ABCD1" 9 18 1e-04 it does what you ask, but the solution you gave below doesn't seem right (picked wrong values of A and B for correct ABC1 candidate?) On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi <mervi.alanne@wri.fi>wrote: > Dear all, > > I'm a novice with R and could use some help. How could I find repeated > observations based on one column and select the one to keep based on > another column? > > In more detail, this is the thing I want to achieve: > -data.frame has 4 columns GeneSymbol, A, B, pvalue > -data in column GeneSymbol may be repeated 1-6 times > -data also contains unique observations > -Of the repeated obs, keep the obs which has the lowest pvalue > -Do not discard data from cols A and B > > Example input data: > GeneSymbol A B pvalue > ABC1 12 44 0.01 > ABC1 2 32 0.05 > AB 4 55 0.2 > ABCD1 15 25 0.005 > ABCD1 11 27 0.002 > ABCD1 9 18 0.0001 > > I'd like the output to look like this: > GeneSymbol A B pvalue > ABC1 2 32 0.01 > AB 4 55 0.2 > ABCD1 9 18 0.0001 > > Any suggestions? > > -Mervi > Wihuri Research Institute > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 15.7 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hi, Thanks for helping me out. However, I couldn't get the script to work. Below is the description. How does the t(sapply . script select the minimum p-value? I understand that the split creates a list where each occurring geneSymbol is present in a separate data frame. How does the script then compare the p-values within each frame and merge the data back into a single data frame? -Mervi > dd <- read.table("Myfile", sep='\t', h=T, as.is=T, colClasses=c("character","numeric","numeric","numeric")) > str(dd) 'data.frame': 6 obs. of 4 variables: $ geneSymbol: chr "ABC1" "ABC1" "AB" "ABCD1" ... $ A : num 12 2 4 15 11 9 $ B : num 44 32 55 25 27 18 $ pvalue : num 1e-02 5e-02 2e-01 5e-03 2e-03 1e-04 > bb<- dd > bbs <- split(bb,bb[,1]) > d<- t(sapply(bbs, function(x)x[which.min(x$originalpvalue),])) > str(d) List of 12 $ : chr(0) $ : chr(0) $ : chr(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) $ : num(0) - attr(*, "dim")= int [1:2] 3 4 - attr(*, "dimnames")=List of 2 ..$ : chr [1:3] "AB" "ABC1" "ABCD1" ..$ : chr [1:4] "geneSymbol" "A" "B" "pvalue" > head(d) geneSymbol A B pvalue AB Character,0 Numeric,0 Numeric,0 Numeric,0 ABC1 Character,0 Numeric,0 Numeric,0 Numeric,0 ABCD1 Character,0 Numeric,0 Numeric,0 Numeric,0 From: Vincent Carey [mailto:stvjc@channing.harvard.edu] Sent: 29. toukokuuta 2010 0:25 To: mervi.alanne@wri.fi Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] finding and deleting repeated observations suppose you save your data as in the email to a file b.txt -- i ignore niceties of delimiter choice there are many ways of doing it, but here is one possibility > bb = read.table("b.txt", h=TRUE, colClasses=c("character", "numeric", "numeric", "numeric")) > bbs = split(bb, bb[,1]) > t(sapply(bbs, function(x) x[which.min(x$pvalue),])) GeneSymbol A B pvalue AB "AB" 4 55 0.2 ABC1 "ABC1" 12 44 0.01 ABCD1 "ABCD1" 9 18 1e-04 it does what you ask, but the solution you gave below doesn't seem right (picked wrong values of A and B for correct ABC1 candidate?) On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi <mervi.alanne@wri.fi> wrote: Dear all, I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column? In more detail, this is the thing I want to achieve: -data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B Example input data: GeneSymbol A B pvalue ABC1 12 44 0.01 ABC1 2 32 0.05 AB 4 55 0.2 ABCD1 15 25 0.005 ABCD1 11 27 0.002 ABCD1 9 18 0.0001 I'd like the output to look like this: GeneSymbol A B pvalue ABC1 2 32 0.01 AB 4 55 0.2 ABCD1 9 18 0.0001 Any suggestions? -Mervi Wihuri Research Institute _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 15.7 years ago Mervi Kinnunen ▴ 10

Login before adding your answer.