Duplicate gene names after summarization with RMA (hugene.1.0.st.v1)

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hello, I am new to analyzing array files. I am attempting to generate a CSV file that contains a gene symbol and RMA-processed expression data for a set of arrays for input into an online pathway ID tool (TNBCtype, http://cbc.mc.vanderbilt.edu/tnbc/). My problem/question (not sure if It is either, or I don't understand the process correctly): when I am exporting the csv file, there are duplicate entries for some gene names (i.e. ESR1). I am under the impression that RMA and the process I am using (target = 'core') summarizes at the gene level, so I am not sure why I am getting duplicate entries for certain (not all) genes after writing the expression file. I have gone through this process with some mouse array data (mouse gene 10 st arrays) and have not run into this problem of duplicate gene names. Any insights on what I might be doing incorrectly, or in understanding the output I should expect, would be greatly appreciated. Is averaging the values of these instances of duplicate gene names a valid thing to do? Thank you! -Ed O'Donnell postdoctoral scholar Oregon state university My commands (Analysis.R), run as source("Analysis.R"): --------------------- #install packages for analysis of the mouse array source("http://bioconductor.org/biocLite.R") biocLite("hugene10sttranscriptcluster.db") biocLite("oligo") biocLite("annotate") #load required packages library(oligo) library(hugene10sttranscriptcluster.db) library(annotate) #set wd to myworkingdirectory setwd("myworkingdirectory") #read in the raw data from the files and the pDatat rawData <- read.celfiles(list.celfiles()) #rma normalization rmaCore <- rma(rawData, target = 'core') #annotation ID <- featureNames(rmaCore) Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db") Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME")) #make a temporary data frame with all the identifiers... tmpframe <-data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) tmpframe[tmpframe=="NA"] <- NA #assign data frame to rma-results fData(rmaCore) <- tmpframe #expression table with gene name and annotation info, processed with sed after export to get the quotations in the right spot and remove NA lines write.table(cbind(pData(featureData(rmaCore))[,"Symbol"],exprs(rmaCore )),file="better_annotation.csv", quote = FALSE, sep = ",") ---------- -- output of sessionInfo(): R version 3.0.3 (2014-03-06) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] pd.hugene.1.0.st.v1_3.8.0 gplots_2.12.1 [3] annotate_1.40.1 hugene10sttranscriptcluster.db_8.0.1 [5] org.Hs.eg.db_2.10.1 RSQLite_0.11.4 [7] DBI_0.2-7 AnnotationDbi_1.24.0 [9] limma_3.18.13 oligo_1.26.6 [11] Biostrings_2.30.1 XVector_0.2.0 [13] IRanges_1.20.7 Biobase_2.22.0 [15] oligoClasses_1.24.0 BiocGenerics_0.8.0 [17] BiocInstaller_1.12.0 loaded via a namespace (and not attached): [1] affxparser_1.34.2 affyio_1.30.0 bit_1.1-11 [4] bitops_1.0-6 caTools_1.16 codetools_0.2-8 [7] ff_2.2-12 foreach_1.4.1 gdata_2.13.2 [10] GenomicRanges_1.14.4 gtools_3.3.1 iterators_1.0.6 [13] KernSmooth_2.23-12 preprocessCore_1.24.0 splines_3.0.3 [16] stats4_3.0.3 tcltk_3.0.3 tools_3.0.3 [19] XML_3.95-0.2 xtable_1.7-3 zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.

Annotation PROcess Annotation PROcess • 1.9k views

ADD COMMENT • link updated 10.0 years ago by James W. MacDonald 65k • written 10.0 years ago by Guest User ★ 13k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Hi Ed, On 4/6/2014 4:24 PM, Ed O'Donnell [guest] wrote: > Hello, > > I am new to analyzing array files. I am attempting to generate a CSV file that contains a gene symbol and RMA-processed expression data for a set of arrays for input into an online pathway ID tool (TNBCtype, http://cbc.mc.vanderbilt.edu/tnbc/). > > My problem/question (not sure if It is either, or I don't understand the process correctly): It's the latter. You are not summarizing at the gene level, but at the transcript level (hence the hugene10sttranscriptcluster.db, not hugene10sttgene.db). In other words, there may be multiple probesets on the array that are intended to measure different transcript variants for the same gene. As an example, for ESR1, there are apparently two probesets, that interrogate two different transcripts (per the ENSEMBL transcript IDs): 8122840 ENST00000338799 8122843 ENST00000440973 And if you go to the respective websites for these two ENSEMBL IDs, you can see that these are very different transcripts. As far as I can tell, the vast majority of people take transcript level data and flatten it to gene level (as you are doing), and then look for differences in quantity without regard to the form of the transcript. In which case you will have 'duplicate' genes. If it is important to only have one gene, then you can use the findLargest function in genefilter, or you could use the MBNI re- mapped cdfs based on Entrez Gene, which map the probesets to the gene level. But note that you will need to use the affy package for analysis with the MBNI cdf packages. Best, Jim > > when I am exporting the csv file, there are duplicate entries for some gene names (i.e. ESR1). I am under the impression that RMA and the process I am using (target = 'core') summarizes at the gene level, so I am not sure why I am getting duplicate entries for certain (not all) genes after writing the expression file. I have gone through this process with some mouse array data (mouse gene 10 st arrays) and have not run into this problem of duplicate gene names. > > Any insights on what I might be doing incorrectly, or in understanding the output I should expect, would be greatly appreciated. > > Is averaging the values of these instances of duplicate gene names a valid thing to do? > > Thank you! > > -Ed O'Donnell > postdoctoral scholar > Oregon state university > > My commands (Analysis.R), run as source("Analysis.R"): > --------------------- > > #install packages for analysis of the mouse array > > source("http://bioconductor.org/biocLite.R") > biocLite("hugene10sttranscriptcluster.db") > biocLite("oligo") > biocLite("annotate") > > #load required packages > > library(oligo) > library(hugene10sttranscriptcluster.db) > library(annotate) > > #set wd to myworkingdirectory > > setwd("myworkingdirectory") > > #read in the raw data from the files and the pDatat > > rawData <- read.celfiles(list.celfiles()) > > #rma normalization > > rmaCore <- rma(rawData, target = 'core') > > #annotation > > ID <- featureNames(rmaCore) > Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db") > Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME")) > > #make a temporary data frame with all the identifiers... > > tmpframe <-data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F) > tmpframe[tmpframe=="NA"] <- NA > > #assign data frame to rma-results > > fData(rmaCore) <- tmpframe > > #expression table with gene name and annotation info, processed with sed after export to get the quotations in the right spot and remove NA lines > > write.table(cbind(pData(featureData(rmaCore))[,"Symbol"],exprs(rmaCo re)),file="better_annotation.csv", quote = FALSE, sep = ",") > > ---------- > > > > -- output of sessionInfo(): > > R version 3.0.3 (2014-03-06) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] pd.hugene.1.0.st.v1_3.8.0 gplots_2.12.1 > [3] annotate_1.40.1 hugene10sttranscriptcluster.db_8.0.1 > [5] org.Hs.eg.db_2.10.1 RSQLite_0.11.4 > [7] DBI_0.2-7 AnnotationDbi_1.24.0 > [9] limma_3.18.13 oligo_1.26.6 > [11] Biostrings_2.30.1 XVector_0.2.0 > [13] IRanges_1.20.7 Biobase_2.22.0 > [15] oligoClasses_1.24.0 BiocGenerics_0.8.0 > [17] BiocInstaller_1.12.0 > > loaded via a namespace (and not attached): > [1] affxparser_1.34.2 affyio_1.30.0 bit_1.1-11 > [4] bitops_1.0-6 caTools_1.16 codetools_0.2-8 > [7] ff_2.2-12 foreach_1.4.1 gdata_2.13.2 > [10] GenomicRanges_1.14.4 gtools_3.3.1 iterators_1.0.6 > [13] KernSmooth_2.23-12 preprocessCore_1.24.0 splines_3.0.3 > [16] stats4_3.0.3 tcltk_3.0.3 tools_3.0.3 > [19] XML_3.95-0.2 xtable_1.7-3 zlibbioc_1.8.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.0 years ago James W. MacDonald 65k

Login before adding your answer.