take the average log fc for each gene

0

Entering edit mode

Helen Smith ▴ 100

@helen-smith-6087

Last seen 9.6 years ago

Hi All. Apologies as I'm just getting to grips with R. I have a set of genes and log fold changes. As the genes have been converted from Affymetrix probes, and up to 11 probes represent one gene, I have a range of different log fold changes for each gene. I would like to take the average log fc for each gene when duplicated in the list. I used the script below but as I'm a bit of a novice it isn't working too well and I get the error message stated at the bottom: dat<-read.table("test.txt") dim(dat) dat[1:664,1:2] Gene<-dat[,1] fc<-dat[,2] LogFC<-matrix(NA,664,1) for(i in 1:664){ for(j in 1:1){ LogFC[i]<-fc[i] } } fc[1:664,1] ####Take Average logfc of multiples of the same gene#### gid<-unique(Gene) length(gid) mGene<-matrix(NA,640,1) mGene for(i in 1:640){ rid<-which(Gene==gid[i]) for(j in 1:1){ mGene[i]<-mean(fc[rid,j]) } } ####I get the error message "Error in fc[rid, j] : incorrect number of dimensions" Any help would be much appriciated, Many thanks everyone! [[alternative HTML version deleted]]

• 1.2k views

ADD COMMENT • link updated 10.7 years ago by James W. MacDonald 65k • written 10.7 years ago by Helen Smith ▴ 100

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

Hi Helen, On 8/12/2013 2:55 PM, Helen Smith wrote: > Hi All. > > Apologies as I'm just getting to grips with R. > > I have a set of genes and log fold changes. > As the genes have been converted from Affymetrix probes, and up to 11 probes represent one gene, I have a range of different log fold changes for each gene. I would like to take the average log fc for each gene when duplicated in the list. This part isn't quite clear to me. I can't tell if you are just summarizing the data using RMA or something similar and are then trying to collapse the data over duplicated genes, or if you are doing something at the probe level. I think you are trying to do the latter, but will you please let us know before we talk about your code below? Best, Jim > > I used the script below but as I'm a bit of a novice it isn't working too well and I get the error message stated at the bottom: > > dat<-read.table("test.txt") > dim(dat) > dat[1:664,1:2] > > Gene<-dat[,1] > fc<-dat[,2] > > LogFC<-matrix(NA,664,1) > for(i in 1:664){ > for(j in 1:1){ > LogFC[i]<-fc[i] > } > } > fc[1:664,1] > > ####Take Average logfc of multiples of the same gene#### > gid<-unique(Gene) > length(gid) > mGene<-matrix(NA,640,1) > mGene > for(i in 1:640){ > rid<-which(Gene==gid[i]) > for(j in 1:1){ > mGene[i]<-mean(fc[rid,j]) > } > } > ####I get the error message "Error in fc[rid, j] : incorrect number of dimensions" > > Any help would be much appriciated, > > Many thanks everyone! > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.7 years ago James W. MacDonald 65k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

Hi Helen, Please don't take conversations off-list. We like to think of the list archives as a searchable repository of knowledge. On 8/13/2013 5:39 AM, Helen Smith wrote: > Hi James, > > I used RMA then annotated the probes with entrez IDs, to which I had several probes with different logfc's representing the same gene. Therefore I am now just trying to collapse the gene names and take the average fc. OK, that's what I thought. So for now we will set aside any concerns that the 'duplicate' probesets might in fact be intended to (or actually do) measure transcript variants, etc. For this case, I am a big fan of the list structure in R, because it makes fast simple iteration possible, and you don't have to do crazy loops and whatnot. The downside is the rather opaque way the various *apply functions work, and the fairly obscure help pages. When I was first figuring out how these things work it was a bit on the painful side. So an example. Let's take all your data and create a list in which each list item contains the data for a given gene. We can then iterate over that list and compute the fold change for each gene. We will then 'flatten' the list back out into a matrix-like structure. From your code below I am going to assume that the Entrez Gene ID is in the first column, and the fold change is in the second. I will ignore all other columns that might exist. thelst <- tapply(1:nrow(dat), dat[,1], function(x) dat[x,2]) The tapply function takes three main arguments. The thing you want to iterate over, a factor that you want to control the iteration, and a function. You could use by() for this as well, but I don't like the structure that is returned, so I tend to use tapply(). Basically what we are doing is creating a vector of numbers that correspond to the rows of your data.frame, then using the second argument to separate these row numbers into groups according to the Entrez Gene ID, and then using the (anonymous) function to select out the fold changes that correspond to each gene. This will give us a list that has the Entrez Gene IDs as the names of each list item, and the fold changes in each list. We can now use sapply() to iterate over each list item and return a named vector: thefcs <- sapply(thelst, mean) We could also be trickier and do in one fell swoop: thefcs <- tapply(1:nrow(dat), dat[,1], function(x) mean(dat[x,2])) Best, Jim > > I hope this is clearer, > Many thanks, > Helen > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at uw.edu] > Sent: 12 August 2013 20:15 > To: Helen Smith > Cc: bioconductor at r-project.org > Subject: Re: [BioC] take the average log fc for each gene > > Hi Helen, > > On 8/12/2013 2:55 PM, Helen Smith wrote: >> Hi All. >> >> Apologies as I'm just getting to grips with R. >> >> I have a set of genes and log fold changes. >> As the genes have been converted from Affymetrix probes, and up to 11 probes represent one gene, I have a range of different log fold changes for each gene. I would like to take the average log fc for each gene when duplicated in the list. > This part isn't quite clear to me. I can't tell if you are just summarizing the data using RMA or something similar and are then trying to collapse the data over duplicated genes, or if you are doing something at the probe level. > > I think you are trying to do the latter, but will you please let us know before we talk about your code below? > > Best, > > Jim > > > >> I used the script below but as I'm a bit of a novice it isn't working too well and I get the error message stated at the bottom: >> >> dat<-read.table("test.txt") >> dim(dat) >> dat[1:664,1:2] >> >> Gene<-dat[,1] >> fc<-dat[,2] >> >> LogFC<-matrix(NA,664,1) >> for(i in 1:664){ >> for(j in 1:1){ >> LogFC[i]<-fc[i] >> } >> } >> fc[1:664,1] >> >> ####Take Average logfc of multiples of the same gene#### >> gid<-unique(Gene) >> length(gid) >> mGene<-matrix(NA,640,1) >> mGene >> for(i in 1:640){ >> rid<-which(Gene==gid[i]) >> for(j in 1:1){ >> mGene[i]<-mean(fc[rid,j]) >> } >> } >> ####I get the error message "Error in fc[rid, j] : incorrect number of dimensions" >> >> Any help would be much appriciated, >> >> Many thanks everyone! >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.7 years ago James W. MacDonald 65k

Login before adding your answer.