Reading Affy CEL files

0

Entering edit mode

Ranjani R ▴ 10

@ranjani-r-5967

Last seen 10.6 years ago

Ranjani R [guest] <guest at="" ...=""> writes: > > > I am a newbie to Affy. Thanks for your help. > > I am processing CEL files through R (Affy package) and am having some basic issues that I am not finding > satisfactory answers to (have googled). > The chip used is hugene11stv1. I also am using the hugene11stprobeset.db to try to do probeset ???> > Symbol translation. > Essentially, I want to create a file with gene expression data, with genes * samples as my final matrix. > > Code: > setwd(wDir); > Data <- ReadAffy(); > eset <- rma(Data); > write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F); > > When I analyze the file written, I see that the number of columns is as I expect(number samples) but there are > 33,297 genes. > Please help me understand a few fundamental aspects here: > > 1. I tried translating these Affy IDs to gene symbols to see if that would make my analysis easier. > Here are some things I tried > > Try 1: > symbols <- getSYMBOL(as.character(expr.matrix[,1]), "hugene11stprobeset"); ???> Not quite > working. Only ~175 of the probeset IDs are getting translated. > Try 2: > symbs <- mget(featureNames(eset), hugene11stprobesetSYMBOL, ifnotfound =NA); > symbs <- unlist(symbs) > mat <- eset; # make a copy > featureNames(mat) <- ifelse(!is.na(symbs), symbs, featureNames(mat)) > > Many NAs. > > Can you please help me understand what is happening here. > > -- output of sessionInfo(): > > R version 2.15.3 (2013-03-01) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hugene11stv1cdf_2.3.0 affy_1.36.1 Biobase_2.18.0 > [4] BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] affyio_1.26.0 BiocInstaller_1.8.3 preprocessCore_1.20.0 > [4] tools_2.15.3 zlibbioc_1.4.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Hi, Thanks for the response. I tried using oligo + hugene11sttranscriptcluster.db to annotate my data. Here is what I've tried to do (after some googling). With the code I executed, 33,197 genes got collapsed to about ~21,000+ genes. I had some questions: I found no vignette for how to do this 'summing up of probesets'. Is this approach taken right (code attached)? If anybody has a better approach/code snippet they can share - that would be great too. Essentially, I need to take the data, run RMA on it - so that I can then do my differential expression analysis (amongst other approaches) Can you please guide me here. I have attached my code snippets below. Thanks much! Ranjani _________________________________________ Code executed: -------------- 1. Convert the probeset to gene Names gMap <- hugene11sttranscriptclusterSYMBOL mProbes <- mappedkeys(gMap) probNames <- rownames(expr.matrix) gMap1 <- as.list(gMap[mProbes]) 2. Get an expression matrix of mapped probes mappedProbes <- sort(which( (probNames %in% names(gMap1)) == T)) any( mappedProbes != sort(mappedProbes)) # FALSE notMappedProbes <- which( (probNames %in% names(gMap1)) == F) mappedNames <- as.character(gMap1[ which(names(gMap1) %in% rownames(expr.matrix)[mappedProbes]) ]) rownames(expr.matrix)[mappedProbes] <- mappedNames expr.matrix <- expr.matrix[-notMappedProbes,,drop=F] 3. Get unique genes. Here, if a gene occurs more than once, I take the mean expression value (is this right)? # get unique genes matrix by taking mean expression of gene names that appear more than one time gn.nms <- rownames(expr.matrix) <- toupper(rownames(expr.matrix)) expr.matrix.unique <- matrix(0, nrow = length(unique(gn.nms)), ncol = ncol(expr.matrix)) seen.gns <- character(); cnt <- 0; for (i in 1:dim(expr.matrix)[1]){ cr.gn <- gn.nms[i] if cr.gn %in% seen.gns){ next } else{ seen.gns <- c(seen.gns,cr.gn) ix <- which(gn.nms %in% cr.gn) if(length(ix)==0) { stop("length(ix) = 0 at i=",ix[1]," gene name is:", cr.gn) } if(length(ix)>1){ cnt <- cnt + 1; print("Taking column means for gene"); printcr.gn); expr.matrix.unique[cnt,] <- colMeans(d[ix,]) } else { cnt <- cnt + 1 expr.matrix.unique[cnt,] <- d[ix,] } } } rownames(expr.matrix.unique) <- seen.gns 4. Now expr.matrix.uniqe should have what I want. Genes * Samples.

annotate affy convert oligo annotate affy convert oligo • 1.3k views

ADD COMMENT • link updated 11.9 years ago by Sidders, Benjamin ▴ 40 • written 11.9 years ago by Ranjani R ▴ 10

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Ranjani, On 6/3/2013 10:52 AM, Ranjani R wrote: > Ranjani R [guest]<guest at="" ...=""> writes: > >> >> I am a newbie to Affy. Thanks for your help. >> >> I am processing CEL files through R (Affy package) and am having some > basic issues that I am not finding >> satisfactory answers to (have googled). >> The chip used is hugene11stv1. I also am using the hugene11stprobeset.db > to try to do probeset ???> >> Symbol translation. >> Essentially, I want to create a file with gene expression data, with > genes * samples as my final matrix. >> Code: >> setwd(wDir); >> Data<- ReadAffy(); >> eset<- rma(Data); >> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F); >> >> When I analyze the file written, I see that the number of columns is as I > expect(number samples) but there are >> 33,297 genes. >> Please help me understand a few fundamental aspects here: >> >> 1. I tried translating these Affy IDs to gene symbols to see if that would > make my analysis easier. >> Here are some things I tried >> >> Try 1: >> symbols<- getSYMBOL(as.character(expr.matrix[,1]), > "hugene11stprobeset"); ???> Not quite >> working. Only ~175 of the probeset IDs are getting translated. >> Try 2: >> symbs<- mget(featureNames(eset), hugene11stprobesetSYMBOL, ifnotfound > =NA); >> symbs<- unlist(symbs) >> mat<- eset; # make a copy >> featureNames(mat)<- ifelse(!is.na(symbs), symbs, featureNames(mat)) >> >> Many NAs. >> >> Can you please help me understand what is happening here. >> >> -- output of sessionInfo(): >> >> R version 2.15.3 (2013-03-01) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] hugene11stv1cdf_2.3.0 affy_1.36.1 Biobase_2.18.0 >> [4] BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.26.0 BiocInstaller_1.8.3 preprocessCore_1.20.0 >> [4] tools_2.15.3 zlibbioc_1.4.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at ... >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > Hi, > Thanks for the response. I tried using oligo + > hugene11sttranscriptcluster.db to annotate my data. Here is what I've tried > to do (after some googling). > > With the code I executed, 33,197 genes got collapsed to about ~21,000+ > genes. > > I had some questions: > I found no vignette for how to do this 'summing up of probesets'. Is this > approach taken right (code attached)? > If anybody has a better approach/code snippet they can share - that would be > great too. I don't have a better approach, as 'better' is dependent on the goals at hand, and what assumptions you are willing to make. I personally tend not to do what you have done, for the simple reason that the array is attempting to measure transcripts (which are different from genes), and two different probesets may well measure differentially spliced transcripts. If you naively collapse to genes, you may well be averaging two very different things together. Or maybe not. I just don't know, and am not willing to make the assumption that it is OK. Anyway, if you really want to collapse to genes (and genes with gene symbols at that), there is already some infrastructure in place that will do something similar. The only difference is that the existing infrastructure will take duplicate Entrez Gene IDs and then select that one with the largest statistic (where you have to decide what that statistic is, but usually one would use a t-stat if you have two groups, or an F-stat if you have more than that). library(genefilter) eset <- nsFilter(eset)$eset ind <- findLargest(eset, <some statistic="" here="">, "hugene11sttranscriptcluster") eset <- eset[ind,] see ?nsFilter and ?findLargest for more information. Best, Jim > > Essentially, I need to take the data, run RMA on it - so that I can then do > my differential expression analysis (amongst other approaches) > > Can you please guide me here. I have attached my code snippets below. > > Thanks much! > > Ranjani > _________________________________________ > > Code executed: > -------------- > > 1. Convert the probeset to gene Names > > gMap<- hugene11sttranscriptclusterSYMBOL > mProbes<- mappedkeys(gMap) > probNames<- rownames(expr.matrix) > gMap1<- as.list(gMap[mProbes]) > > 2. Get an expression matrix of mapped probes > > mappedProbes<- sort(which( (probNames %in% names(gMap1)) == T)) > any( mappedProbes != sort(mappedProbes)) # FALSE > notMappedProbes<- which( (probNames %in% names(gMap1)) == F) > mappedNames<- as.character(gMap1[ which(names(gMap1) %in% > rownames(expr.matrix)[mappedProbes]) ]) > rownames(expr.matrix)[mappedProbes]<- mappedNames > expr.matrix<- expr.matrix[-notMappedProbes,,drop=F] > > 3. Get unique genes. Here, if a gene occurs more than once, I take the mean > expression value (is this right)? > > # get unique genes matrix by taking mean expression of gene names that > appear more than one time > gn.nms<- rownames(expr.matrix)<- toupper(rownames(expr.matrix)) > expr.matrix.unique<- matrix(0, nrow = length(unique(gn.nms)), ncol = > ncol(expr.matrix)) > seen.gns<- character(); > cnt<- 0; > for (i in 1:dim(expr.matrix)[1]){ > cr.gn<- gn.nms[i] > if cr.gn %in% seen.gns){ > next > } else{ > seen.gns<- c(seen.gns,cr.gn) > ix<- which(gn.nms %in% cr.gn) > if(length(ix)==0) { > stop("length(ix) = 0 at i=",ix[1]," gene name is:", > cr.gn) > } > if(length(ix)>1){ > cnt<- cnt + 1; > print("Taking column means for gene"); > printcr.gn); > expr.matrix.unique[cnt,]<- colMeans(d[ix,]) > } else { > cnt<- cnt + 1 > expr.matrix.unique[cnt,]<- d[ix,] > } > } > } > rownames(expr.matrix.unique)<- seen.gns > > > 4. Now expr.matrix.uniqe should have what I want. Genes * Samples. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.9 years ago James W. MacDonald 68k

0

Entering edit mode

Sidders, Benjamin ▴ 40

@sidders-benjamin-5968

Last seen 9.6 years ago

Cambridge, UK

The oligo package has it's own implementation of RMA (?rma) that takes an argument ("target") which allows you to change the level at which it summarizes probesets on an exon array by either "probeset" (exon level) or "core" (transcript level), as long as you have annotation to support it. Ben On 03/06/2013 15:52, "Ranjani R" <ranjanir at="" uw.edu=""> wrote: > >Ranjani R [guest] <guest at="" ...=""> writes: > >> >> >> I am a newbie to Affy. Thanks for your help. >> >> I am processing CEL files through R (Affy package) and am having some >basic issues that I am not finding >> satisfactory answers to (have googled). >> The chip used is hugene11stv1. I also am using the >>hugene11stprobeset.db >to try to do probeset ???> >> Symbol translation. >> Essentially, I want to create a file with gene expression data, with >genes * samples as my final matrix. >> >> Code: >> setwd(wDir); >> Data <- ReadAffy(); >> eset <- rma(Data); >> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F); >> >> When I analyze the file written, I see that the number of columns is as >>I >expect(number samples) but there are >> 33,297 genes. >> Please help me understand a few fundamental aspects here: >> >> 1. I tried translating these Affy IDs to gene symbols to see if that >>would >make my analysis easier. >> Here are some things I tried >> >> Try 1: >> symbols <- getSYMBOL(as.character(expr.matrix[,1]), >"hugene11stprobeset"); ???> Not quite >> working. Only ~175 of the probeset IDs are getting translated. >> Try 2: >> symbs <- mget(featureNames(eset), hugene11stprobesetSYMBOL, >>ifnotfound >=NA); >> symbs <- unlist(symbs) >> mat <- eset; # make a copy >> featureNames(mat) <- ifelse(!is.na(symbs), symbs, featureNames(mat)) >> >> Many NAs. >> >> Can you please help me understand what is happening here. >> >> -- output of sessionInfo(): >> >> R version 2.15.3 (2013-03-01) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] hugene11stv1cdf_2.3.0 affy_1.36.1 Biobase_2.18.0 >> [4] BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.26.0 BiocInstaller_1.8.3 preprocessCore_1.20.0 >> [4] tools_2.15.3 zlibbioc_1.4.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at ... >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > >Hi, >Thanks for the response. I tried using oligo + >hugene11sttranscriptcluster.db to annotate my data. Here is what I've >tried >to do (after some googling). > >With the code I executed, 33,197 genes got collapsed to about ~21,000+ >genes. > >I had some questions: >I found no vignette for how to do this 'summing up of probesets'. Is this >approach taken right (code attached)? >If anybody has a better approach/code snippet they can share - that would >be >great too. > >Essentially, I need to take the data, run RMA on it - so that I can then >do >my differential expression analysis (amongst other approaches) > >Can you please guide me here. I have attached my code snippets below. > >Thanks much! > >Ranjani >_________________________________________ > >Code executed: >-------------- > >1. Convert the probeset to gene Names > >gMap <- hugene11sttranscriptclusterSYMBOL >mProbes <- mappedkeys(gMap) >probNames <- rownames(expr.matrix) >gMap1 <- as.list(gMap[mProbes]) > >2. Get an expression matrix of mapped probes > >mappedProbes <- sort(which( (probNames %in% names(gMap1)) == T)) >any( mappedProbes != sort(mappedProbes)) # FALSE >notMappedProbes <- which( (probNames %in% names(gMap1)) == F) >mappedNames <- as.character(gMap1[ which(names(gMap1) %in% >rownames(expr.matrix)[mappedProbes]) ]) >rownames(expr.matrix)[mappedProbes] <- mappedNames >expr.matrix <- expr.matrix[-notMappedProbes,,drop=F] > >3. Get unique genes. Here, if a gene occurs more than once, I take the >mean >expression value (is this right)? > ># get unique genes matrix by taking mean expression of gene names that >appear more than one time >gn.nms <- rownames(expr.matrix) <- toupper(rownames(expr.matrix)) >expr.matrix.unique <- matrix(0, nrow = length(unique(gn.nms)), ncol = >ncol(expr.matrix)) >seen.gns <- character(); >cnt <- 0; >for (i in 1:dim(expr.matrix)[1]){ > cr.gn <- gn.nms[i] > if cr.gn %in% seen.gns){ > next > } else{ > seen.gns <- c(seen.gns,cr.gn) > ix <- which(gn.nms %in% cr.gn) > if(length(ix)==0) { > stop("length(ix) = 0 at i=",ix[1]," gene name is:", >cr.gn) > } > if(length(ix)>1){ > cnt <- cnt + 1; > print("Taking column means for gene"); > printcr.gn); > expr.matrix.unique[cnt,] <- colMeans(d[ix,]) > } else { > cnt <- cnt + 1 > expr.matrix.unique[cnt,] <- d[ix,] > } > } >} >rownames(expr.matrix.unique) <- seen.gns > > >4. Now expr.matrix.uniqe should have what I want. Genes * Samples. > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.9 years ago Sidders, Benjamin ▴ 40

Login before adding your answer.