annotation of a probe in hgu133plus2

0

Entering edit mode

Mayte Suarez-Farinas ▴ 510

@mayte-suarez-farinas-2068

Last seen 9.0 years ago

United States

Dear annotation builders I notice that one of my favorite genes DEFB4A is missing from the hgu133plus2.db the NetAffy database says that probeset 207356_at correspond to gene DEFB4A yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen in newer versions of this package, since it has been tehre before.. This gene is extremely important in lots skin diseases, inflammation etc, so it is a concern for my colleagues Any help is appreciated, Mayte Suarez-Farinas Research Assistant Professor Laboratory of Investigative Dermatology Biostatistician, Center for Clinical and Translational Science The Rockefeller University 1230 York Ave, Box 178 New York, NY 10065 Phone: +1(212) 327-8213 Fax: +1(212) 327-8232

Annotation Annotation • 2.2k views

ADD COMMENT • link updated 10.3 years ago by James W. MacDonald 65k • written 10.3 years ago by Mayte Suarez-Farinas ▴ 510

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 42 minutes ago

United States

Hi Mayte, The gene isn't missing, instead this probeset interrogates two different genes. Plus you are using really old methods to get the data you want. These days you should use select(). > select(hgu133plus2.db, "207356_at", c("SYMBOL","ENTREZID")) PROBEID SYMBOL ENTREZID 1 207356_at DEFB4A 1673 2 207356_at DEFB4B 100289462 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows And you get a warning saying that this probeset maps to more than one gene and symbol. Best, Jim On 1/9/2014 12:10 PM, Mayte Suarez-Farinas wrote: > Dear annotation builders > > I notice that one of my favorite genes DEFB4A is missing from the hgu133plus2.db > the NetAffy database says that probeset 207356_at correspond to gene DEFB4A > yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen in newer versions of > this package, since it has been tehre before.. This gene is extremely important in > lots skin diseases, inflammation etc, so it is a concern for my colleagues > Any help is appreciated, > > Mayte Suarez-Farinas > Research Assistant Professor > Laboratory of Investigative Dermatology > Biostatistician, Center for Clinical and Translational Science > The Rockefeller University > 1230 York Ave, Box 178 > New York, NY 10065 > Phone: +1(212) 327-8213 > Fax: +1(212) 327-8232 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.3 years ago James W. MacDonald 65k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 42 minutes ago

United States

Hi Mayte, Please don't take conversations off-list. On Thursday, January 09, 2014 2:09:07 PM, Mayte Suarez-Farinas wrote: > Hi Jim > > Thanks for your answer. i am using the annaffy functions to crete the > the annotation tables. can you suggest a newer and easy package/set of > functions > to do this, avoiding getting a NA in probes with 2 entrezid?? Well, I can make some recommendations, but I don't have any easy suggestions. The reason these are coming up as NA in the first place is because it isn't clear what you are measuring with this probeset, or any other probeset that interrogates more than one gene. In fact you could have a scenario where one sample type is only expressing DEFB4A and the other is only expressing DEFB4B, but you would think there isn't a difference! For this gene it may not matter - I don't know anything about this gene - but whatever you do, you will have to make some assumptions about ALL of the multiply-mapped probesets, and what might be reasonable for one gene may not be reasonable at all for another gene. That is one of the reasons that MBNI came up with the re-mapped cdfs, so you can have some assurance that you are measuring the transcript from a single gene without confounding. So that is one thing you could consider, switching to the MBNI re-mapped cdf for this array. There are caveats to using those cdfs as well (mainly that the number of probes in a probeset varies widely, so the accuracy of your expression estimate varies as well. However, when you make comparisons you don't then take this variance into account, unless you use something like the puma package). These days I am using ReportingTools rather than annaffy. But just switching still won't help, as ReportingTools uses similar methods to annotate. Instead you will have to do some work yourself first. Let's assume you are using limma to do the analysis, and you are creating an MArrayLM object called 'fit2'. fit <- lmFit(eset, design) fit2 <- contrasts.fit(fit, contrast) fit2 <- eBayes(fit2) Now we can add annotations to the fit2 object so topTable() will append the annotations. gns <- select(hgu133plus2.db, featureNames(eset), c("ENTREZID","SYMBOL","GENENAME")) You can add other things as well. The problem now is that the gns data.frame is not the right dimension because of the multiply-mapped probesets. This is where the tough decisions come in. As I see it you can: 1.) Subset gns to have just one row per probeset. a.) by doing something really naive like taking just the first instance of a probeset b.) or by randomly selecting one of the multiple mappings 2.) Collapse multiple probesets so e.g. your favorite gene would now be 207356_at DEFB4A|DEFB4B 1673|100289462 and you wouldn't lose any information. However, it will be trickier to make links to e.g., Entrez Gene for these cells if you care about those things. If you do 1a or 1b (let's use 1a as an example), you can do: library(affycoretools) ## shameless plug, I know gns <- gns[!duplicated(gns[,1]),] fit2$genes <- gns tab <- topTable(fit2, coef = 1, <other arguments="" go="" here="">) htab <- HTMLReport("coef1", "Data for contrast 1", reportDirectory = "reports/") publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks)) finish(htab) indx <- HTMLReport("index", "Main page") publish(Link(htab), indx) finish(indx) And then you will have a file called index.html that you can double click, and on that page will be a link you can use to see your data. The link will bring up an HTML table and there will be links to netaffx and Entrez Gene for the probeset IDs and Gene IDs (that's why you load affycoretools; to get the entrezLinks and affyLinks functions). If you want to go through door #2, you won't be able to make links to Entrez Gene without creating a function of your own. Instead you could forgo those links and just do gnlst <- tapply(1:nrow(gns), gns$PROBEID, function(x) gns[x,]) gnlst <- lapply(gnlst, function(x) apply(x, 2, function(y) paste(unique(y), collapse = " | "))) gns <- do.call("rbind", gnlst) fit2$genes <- gns and everything else the same as above except no entrezLinks for .modifyDF. Best, Jim > > thanks!!! > > Mayte Suarez-Farinas > Research Assistant Professor > Laboratory of Investigative Dermatology > Biostatistician, Center for Clinical and Translational Science > The Rockefeller University > 1230 York Ave, Box 178 > New York, NY 10065 > Phone: +1(212) 327-8213 > Fax: +1(212) 327-8232 > > > > On Jan 9, 2014, at 1:11 PM, James W. MacDonald wrote: > >> Hi Mayte, >> >> The gene isn't missing, instead this probeset interrogates two >> different genes. Plus you are using really old methods to get the >> data you want. These days you should use select(). >> >> > select(hgu133plus2.db, "207356_at", c("SYMBOL","ENTREZID")) >> PROBEID SYMBOL ENTREZID >> 1 207356_at DEFB4A 1673 >> 2 207356_at DEFB4B 100289462 >> Warning message: >> In .generateExtraRows(tab, keys, jointype) : >> 'select' resulted in 1:many mapping between keys and return rows >> >> And you get a warning saying that this probeset maps to more than one >> gene and symbol. >> >> Best, >> >> Jim >> >> On 1/9/2014 12:10 PM, Mayte Suarez-Farinas wrote: >>> Dear annotation builders >>> >>> I notice that one of my favorite genes DEFB4A is missing from the >>> hgu133plus2.db >>> the NetAffy database says that probeset 207356_at correspond to gene >>> DEFB4A >>> yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen >>> in newer versions of >>> this package, since it has been tehre before.. This gene is >>> extremely important in >>> lots skin diseases, inflammation etc, so it is a concern for my >>> colleagues >>> Any help is appreciated, >>> >>> Mayte Suarez-Farinas >>> Research Assistant Professor >>> Laboratory of Investigative Dermatology >>> Biostatistician, Center for Clinical and Translational Science >>> The Rockefeller University >>> 1230 York Ave, Box 178 >>> New York, NY 10065 >>> Phone: +1(212) 327-8213 >>> Fax: +1(212) 327-8232 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.3 years ago James W. MacDonald 65k

0

Entering edit mode

Hi James Thanks for guiding me in the right direction. I am updating my pipeline to accommodate this. My only problem was the line publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks)) gave me a msg error Error in .self$prepare(value, .toHTML = .toHTML, .toDF = .toDF, .modifyDF = .modifyDF, : object 'affyLinks' not found or equivalently, publish(tt, htab, .modifyDF = list(entrezLinks ,affyLinks )) Error in .self$prepare(value, .toHTML = .toHTML, .toDF = .toDF, .modifyDF = .modifyDF, : object 'entrezLinks' not found I am sure this is an oversight of my part but after spending some Sunday hours trying to figure it out, I decided to ask you. Best, mayte > > On Thursday, January 09, 2014 2:09:07 PM, Mayte Suarez-Farinas wrote: >> Hi Jim >> >> Thanks for your answer. i am using the annaffy functions to crete the >> the annotation tables. can you suggest a newer and easy package/set of >> functions >> to do this, avoiding getting a NA in probes with 2 entrezid?? > > Well, I can make some recommendations, but I don't have any easy suggestions. The reason these are coming up as NA in the first place is because it isn't clear what you are measuring with this probeset, or any other probeset that interrogates more than one gene. In fact you could have a scenario where one sample type is only expressing DEFB4A and the other is only expressing DEFB4B, but you would think there isn't a difference! > > For this gene it may not matter - I don't know anything about this gene - but whatever you do, you will have to make some assumptions about ALL of the multiply-mapped probesets, and what might be reasonable for one gene may not be reasonable at all for another gene. > > That is one of the reasons that MBNI came up with the re-mapped cdfs, so you can have some assurance that you are measuring the transcript from a single gene without confounding. So that is one thing you could consider, switching to the MBNI re-mapped cdf for this array. There are caveats to using those cdfs as well (mainly that the number of probes in a probeset varies widely, so the accuracy of your expression estimate varies as well. However, when you make comparisons you don't then take this variance into account, unless you use something like the puma package). > > These days I am using ReportingTools rather than annaffy. But just switching still won't help, as ReportingTools uses similar methods to annotate. Instead you will have to do some work yourself first. Let's assume you are using limma to do the analysis, and you are creating an MArrayLM object called 'fit2'. > > fit <- lmFit(eset, design) > fit2 <- contrasts.fit(fit, contrast) > fit2 <- eBayes(fit2) > > Now we can add annotations to the fit2 object so topTable() will append the annotations. > > gns <- select(hgu133plus2.db, featureNames(eset), c("ENTREZID","SYMBOL","GENENAME")) > > You can add other things as well. The problem now is that the gns data.frame is not the right dimension because of the multiply-mapped probesets. This is where the tough decisions come in. As I see it you can: > > 1.) Subset gns to have just one row per probeset. > a.) by doing something really naive like taking just the first instance of a probeset > b.) or by randomly selecting one of the multiple mappings > 2.) Collapse multiple probesets so e.g. your favorite gene would now be > > 207356_at DEFB4A|DEFB4B 1673|100289462 > > and you wouldn't lose any information. However, it will be trickier to make links to e.g., Entrez Gene for these cells if you care about those things. If you do 1a or 1b (let's use 1a as an example), you can do: > > library(affycoretools) ## shameless plug, I know > gns <- gns[!duplicated(gns[,1]),] > fit2$genes <- gns > > tab <- topTable(fit2, coef = 1, <other arguments="" go="" here="">) > > htab <- HTMLReport("coef1", "Data for contrast 1", reportDirectory = "reports/") > publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks)) > finish(htab) > indx <- HTMLReport("index", "Main page") > publish(Link(htab), indx) > finish(indx) > > > And then you will have a file called index.html that you can double click, and on that page will be a link you can use to see your data. The link will bring up an HTML table and there will be links to netaffx and Entrez Gene for the probeset IDs and Gene IDs (that's why you load affycoretools; to get the entrezLinks and affyLinks functions). > > If you want to go through door #2, you won't be able to make links to Entrez Gene without creating a function of your own. Instead you could forgo those links and just do > > gnlst <- tapply(1:nrow(gns), gns$PROBEID, function(x) gns[x,]) > gnlst <- lapply(gnlst, function(x) apply(x, 2, function(y) paste(unique(y), collapse = " | "))) > gns <- do.call("rbind", gnlst) > > fit2$genes <- gns > > and everything else the same as above except no entrezLinks for .modifyDF. > > Best, > > Jim > >> >> thanks!!! >> >> Mayte Suarez-Farinas >> Research Assistant Professor >> Laboratory of Investigative Dermatology >> Biostatistician, Center for Clinical and Translational Science >> The Rockefeller University >> 1230 York Ave, Box 178 >> New York, NY 10065 >> Phone: +1(212) 327-8213 >> Fax: +1(212) 327-8232 >> >> >> >> On Jan 9, 2014, at 1:11 PM, James W. MacDonald wrote: >> >>> Hi Mayte, >>> >>> The gene isn't missing, instead this probeset interrogates two >>> different genes. Plus you are using really old methods to get the >>> data you want. These days you should use select(). >>> >>> > select(hgu133plus2.db, "207356_at", c("SYMBOL","ENTREZID")) >>> PROBEID SYMBOL ENTREZID >>> 1 207356_at DEFB4A 1673 >>> 2 207356_at DEFB4B 100289462 >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' resulted in 1:many mapping between keys and return rows >>> >>> And you get a warning saying that this probeset maps to more than one >>> gene and symbol. >>> >>> Best, >>> >>> Jim >>> >>> On 1/9/2014 12:10 PM, Mayte Suarez-Farinas wrote: >>>> Dear annotation builders >>>> >>>> I notice that one of my favorite genes DEFB4A is missing from the >>>> hgu133plus2.db >>>> the NetAffy database says that probeset 207356_at correspond to gene >>>> DEFB4A >>>> yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen >>>> in newer versions of >>>> this package, since it has been tehre before.. This gene is >>>> extremely important in >>>> lots skin diseases, inflammation etc, so it is a concern for my >>>> colleagues >>>> Any help is appreciated, >>>> >>>> Mayte Suarez-Farinas >>>> Research Assistant Professor >>>> Laboratory of Investigative Dermatology >>>> Biostatistician, Center for Clinical and Translational Science >>>> The Rockefeller University >>>> 1230 York Ave, Box 178 >>>> New York, NY 10065 >>>> Phone: +1(212) 327-8213 >>>> Fax: +1(212) 327-8232 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099

ADD REPLY • link 10.3 years ago Mayte Suarez-Farinas ▴ 510

Login before adding your answer.