Question: observations on affyprobeminer
0
11.1 years ago by
Mark W Kimpel830
Mark W Kimpel830 wrote:
I have recently explored the use of alternative CDFs from affyprobeminer (APM) or a 36 array dataset derived using the Affy rat2302 chipset. I used both the Affy cdf and the transcript-level affyprobeminer cdf. I preprocessed using RMA, filtered using an A/P filter, and statistically analyzed using an appropriate lme model followed by qvalue FDR correction. I set my FDR threshold at 5%. I eliminated duplicate genes by picking the one with the lowest p-value. Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I choose only those EntrezGene identifiers present on both cdfs, my number sig. with the APM cdf was ~1000 and there was a 90% overlap with the Affy sig. list. My conclusion from the latter observation is that I am measuring largely the same transcripts/genes with both CDFs. I was interested in the ~1000 genes which are annotated with the Affy CDF but not the APM cdf. Following the logic behind APM, I would assume that these would be largely incorrectly annotated probesets or probesets that are not really measuring any "real" transcript. This list should, then, consist largely of random genes. To test this hypothesis, I used the Category package to test for over-representation of GO and KEGG categories in my various lists. What I found was a huge degree of overlap between: 1. the affy genes also annotated with APM, 2. the affy genes not annotated with APM, 3. the genes derived solely from APM. My conclusion from this latest observation is that APM is not annotating a large number of genes/transcripts that are in fact real. Assuming that APM is correctly throwing out some "junk" probesets, is it throwing out the baby with the bathwater? I'd be interested to hear the thoughts and experiences of others. I've certainly run into occasions where Affy annotated probesets turn out to represent introns or something other than they purport to be, and I was hoping that APM would solve this problem, but I don't want to use it if it means a massive loss of truly significant data. Mark -- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 663-0513 Home (no voice mail please) ************************************************************** [[alternative HTML version deleted]]
go rat2302 cdf affy qvalue • 579 views
modified 11.0 years ago by Hongfang Liu10 • written 11.1 years ago by Mark W Kimpel830
0
11.1 years ago by
lgautier@altern.org950 wrote:
> I have recently explored the use of alternative CDFs from affyprobeminer (APM) or a 36 array dataset derived using the Affy rat2302 chipset. I used > both the Affy cdf and the transcript-level affyprobeminer cdf. I preprocessed using RMA, filtered using an A/P filter, and statistically analyzed using an appropriate lme model followed by qvalue FDR correction. > I > set my FDR threshold at 5%. I eliminated duplicate genes by picking the one > with the lowest p-value. > > Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I choose only those EntrezGene identifiers present on both cdfs, my number sig. with > the APM cdf was ~1000 and there was a 90% overlap with the Affy sig. list. > My conclusion from the latter observation is that I am measuring largely the > same transcripts/genes with both CDFs. > > I was interested in the ~1000 genes which are annotated with the Affy CDF > but not the APM cdf. Following the logic behind APM, I would assume that these would be largely incorrectly annotated probesets or probesets that are > not really measuring any "real" transcript. This list should, then, consist > largely of random genes. To test this hypothesis, I used the Category package to test for over-representation of GO and KEGG categories in my various lists. What I found was a huge degree of overlap between: 1. the affy genes also annotated with APM, 2. the affy genes not annotated with APM, 3. the genes derived solely from APM. > > My conclusion from this latest observation is that APM is not annotating a > large number of genes/transcripts that are in fact real. Assuming that APM > is correctly throwing out some "junk" probesets, is it throwing out the baby > with the bathwater? Not necessarily. With Affymetrix mappings, there are a large number of cases from which there are multiple probesets for a "gene" (in the example below with hgu133a, that represents 20% of the probesets), and those probesets can be collapsed into one when remapping. Here is an example with few probesets (the example is mostly a copy- paste from one of the examples in the vignette "altcdfenvs"): geneSymbols <- c("IGKC", "IL8", "NENF", "TRIO") # Count the probesets associated with our geneSymbols library(hgu133a.db) sapply(geneSymbols, function(x) length(mappedkeys(subset(hgu133aSYMBOL, Rkeys=x)))) # This returns: #IGKC IL8 NENF TRIO # 15 12 6 9 # Which means that there are 9 probesets for TRIO, 6 for NENF, etc... # Now lets check what comes out of remapping library(altcdfenvs) library(biomaRt) mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl") getSeq <- function(name) { seq <- getSequence(id=name, type="hgnc_symbol", seqType="cdna", mart = mart) targets <- seq$cdna if (is.null(targets)) return(character(0)) names(targets) <- paste(seq$hgnc_symbol, 1:nrow(seq), sep="-") return(targets) } targets <- unlist(lapply(geneSymbols, getSeq)) m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A") hg <- toHypergraph(m) gn <- toGraphNEL(hg) library(RColorBrewer) col <- brewer.pal(length(geneSymbols)+1, "Set1") tColors <- rep(col[length(col)], length=numNodes(gn)) names(tColors) <- nodes(gn) for (col_i in 1:(length(col)-1)) { node_i <- grep(paste("^", geneSymbols[col_i], "-", sep=""), names(tColors)) tColors[node_i] <- col[col_i] } nAttrs <- list(fillcolor = tColors) library(Rgraphviz) plot(gn, "twopi", nodeAttrs=nAttrs) # the plot will show that the situation is not as simple for TRIO as # as it is with the other gene symbols. > I'd be interested to hear the thoughts and experiences of others. I've certainly run into occasions where Affy annotated probesets turn out to represent introns or something other than they purport to be, and I was hoping that APM would solve this problem, but I don't want to use it if it > means a massive loss of truly significant data. The situation is indeed not always clear... at the moment, I would not advice you to follow blindly any particular mapping, yet have alternative mappings as part of your routine analysis: depending on the cost of follow-up experiments, or downstream analysis, time should be spent looking at probesets in details. Hoping this helps, L. > Mark > > > > -- > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 663-0513 Home (no voice mail please) > > ************************************************************** > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
I was not clear in my original post, I filtered out duplicate entrez gene IDs taking only the one with the lowest p-value. I do appreciate your advice, we still don't have a perfect solution and it is good to use multiple tools to look at each dataset, always taking into consideration a priori the questions that we really want answered, the cost/benefit ratio of increased sensitivity/decreased specificity, etc. with various techniques. I do thank you for your efforts at developing APM, it is a great tool and one that I am sure to use, and publish with, in the future, Mark On Sun, May 4, 2008 at 5:46 AM, <lgautier@altern.org> wrote: > > I have recently explored the use of alternative CDFs from affyprobeminer > (APM) or a 36 array dataset derived using the Affy rat2302 chipset. I > used > > both the Affy cdf and the transcript-level affyprobeminer cdf. I > preprocessed using RMA, filtered using an A/P filter, and statistically > analyzed using an appropriate lme model followed by qvalue FDR > correction. > > I > > set my FDR threshold at 5%. I eliminated duplicate genes by picking the > one > > with the lowest p-value. > > > > Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I choose > only those EntrezGene identifiers present on both cdfs, my number sig. > with > > the APM cdf was ~1000 and there was a 90% overlap with the Affy sig. > list. > > My conclusion from the latter observation is that I am measuring largely > the > > same transcripts/genes with both CDFs. > > > > I was interested in the ~1000 genes which are annotated with the Affy > CDF > > but not the APM cdf. Following the logic behind APM, I would assume that > these would be largely incorrectly annotated probesets or probesets that > are > > not really measuring any "real" transcript. This list should, then, > consist > > largely of random genes. To test this hypothesis, I used the Category > package to test for over-representation of GO and KEGG categories in my > various lists. What I found was a huge degree of overlap between: 1. the > affy genes also annotated with APM, 2. the affy genes not annotated with > APM, 3. the genes derived solely from APM. > > > > My conclusion from this latest observation is that APM is not annotating > a > > large number of genes/transcripts that are in fact real. Assuming that > APM > > is correctly throwing out some "junk" probesets, is it throwing out the > baby > > with the bathwater? > > Not necessarily. > > With Affymetrix mappings, there are a large number of cases from which > there are multiple probesets for a "gene" (in the example below > with hgu133a, that represents 20% of the probesets), and those probesets > can be collapsed into one when remapping. > > Here is an example with few probesets (the example is mostly a copy- paste > from one of the examples in the vignette "altcdfenvs"): > > geneSymbols <- c("IGKC", "IL8", "NENF", "TRIO") > > # Count the probesets associated with our geneSymbols > library(hgu133a.db) > sapply(geneSymbols, > function(x) length(mappedkeys(subset(hgu133aSYMBOL, Rkeys=x)))) > # This returns: > #IGKC IL8 NENF TRIO > # 15 12 6 9 > # Which means that there are 9 probesets for TRIO, 6 for NENF, etc... > > > > # Now lets check what comes out of remapping > library(altcdfenvs) > library(biomaRt) > mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl") > > > getSeq <- function(name) { > seq <- getSequence(id=name, type="hgnc_symbol", > seqType="cdna", mart = mart) > > targets <- seq$cdna > if (is.null(targets)) > return(character(0)) > names(targets) <- paste(seq$hgnc_symbol, 1:nrow(seq), sep="-") > return(targets) > } > > > targets <- unlist(lapply(geneSymbols, > getSeq)) > m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A") > > hg <- toHypergraph(m) > > > gn <- toGraphNEL(hg) > > library(RColorBrewer) > col <- brewer.pal(length(geneSymbols)+1, "Set1") > tColors <- rep(col[length(col)], length=numNodes(gn)) > names(tColors) <- nodes(gn) > for (col_i in 1:(length(col)-1)) { > node_i <- grep(paste("^", geneSymbols[col_i], > "-", sep=""), > names(tColors)) > tColors[node_i] <- col[col_i] > } > > > nAttrs <- list(fillcolor = tColors) > > library(Rgraphviz) > plot(gn, "twopi", nodeAttrs=nAttrs) > > # the plot will show that the situation is not as simple for TRIO as > # as it is with the other gene symbols. > > > > > > I'd be interested to hear the thoughts and experiences of others. I've > certainly run into occasions where Affy annotated probesets turn out to > represent introns or something other than they purport to be, and I was > hoping that APM would solve this problem, but I don't want to use it if > it > > means a massive loss of truly significant data. > > The situation is indeed not always clear... at the moment, I would not > advice you to follow blindly any particular mapping, yet have alternative > mappings as part of your routine analysis: depending on the cost of > follow-up experiments, or downstream analysis, time should be spent > looking at probesets in details. > > > Hoping this helps, > > > L. > > > > > > > Mark > > > > > > > > -- > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > > Indiana University School of Medicine > > > > 15032 Hunter Court, Westfield, IN 46074 > > > > (317) 490-5129 Work, & Mobile & VoiceMail > > (317) 663-0513 Home (no voice mail please) > > > > ************************************************************** > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > -- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 663-0513 Home (no voice mail please) ****************************************************************** [[alternative HTML version deleted]]
> I was not clear in my original post, I filtered out duplicate entrez gene > IDs taking only the one with the lowest p-value. I see... Then there is really no simple answer to what you are observing; some of the probesets in the Affymetrix mapping are the result of combining information from several sources, and are possibly measuring a genuine RNA signal that is not (yet) included in RefSeq for example. Hongfang has mentioned more in details what could be happening with splice variants. The MBNI is also providing alternative mappings coming from several sources of target sequences (RefSeq, ensembl, ...); they are definitely worth a look. > I do appreciate your > advice, we still don't have a perfect solution and it is good to use multiple tools to look at each dataset, always taking into consideration a > priori the questions that we really want answered, the cost/benefit ratio > of > increased sensitivity/decreased specificity, etc. with various techniques. When it comes to building hypothesis around biology, the mappings help telling whether a probeset is definitely mesuring the level of a particular transcript or might be mesuring a less defined "something". > I do thank you for your efforts at developing APM, it is a great tool and > one that I am sure to use, and publish with, in the future, All credits for affyprobeminer should go to its authors. (after looking it up, the list seem to be there: http://gauss.dbb.georgetown.edu/liblab/affyprobeminer/credits.html ) My effort on the issue was made available in the package "altcdfenvs". L. > Mark > > On Sun, May 4, 2008 at 5:46 AM, <lgautier at="" altern.org=""> wrote: > >> > I have recently explored the use of alternative CDFs from >> affyprobeminer >> (APM) or a 36 array dataset derived using the Affy rat2302 chipset. I used >> > both the Affy cdf and the transcript-level affyprobeminer cdf. I >> preprocessed using RMA, filtered using an A/P filter, and statistically analyzed using an appropriate lme model followed by qvalue FDR >> correction. >> > I >> > set my FDR threshold at 5%. I eliminated duplicate genes by picking >> the >> one >> > with the lowest p-value. >> > >> > Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I >> choose >> only those EntrezGene identifiers present on both cdfs, my number sig. with >> > the APM cdf was ~1000 and there was a 90% overlap with the Affy sig. >> list. >> > My conclusion from the latter observation is that I am measuring >> largely >> the >> > same transcripts/genes with both CDFs. >> > >> > I was interested in the ~1000 genes which are annotated with the Affy >> CDF >> > but not the APM cdf. Following the logic behind APM, I would assume >> that >> these would be largely incorrectly annotated probesets or probesets that >> are >> > not really measuring any "real" transcript. This list should, then, >> consist >> > largely of random genes. To test this hypothesis, I used the Category >> package to test for over-representation of GO and KEGG categories in my various lists. What I found was a huge degree of overlap between: 1. the >> affy genes also annotated with APM, 2. the affy genes not annotated with >> APM, 3. the genes derived solely from APM. >> > >> > My conclusion from this latest observation is that APM is not >> annotating >> a >> > large number of genes/transcripts that are in fact real. Assuming that >> APM >> > is correctly throwing out some "junk" probesets, is it throwing out >> the >> baby >> > with the bathwater? >> Not necessarily. >> With Affymetrix mappings, there are a large number of cases from which there are multiple probesets for a "gene" (in the example below with hgu133a, that represents 20% of the probesets), and those probesets >> can be collapsed into one when remapping. >> Here is an example with few probesets (the example is mostly a >> copy-paste >> from one of the examples in the vignette "altcdfenvs"): >> geneSymbols <- c("IGKC", "IL8", "NENF", "TRIO") >> # Count the probesets associated with our geneSymbols >> library(hgu133a.db) >> sapply(geneSymbols, >> function(x) length(mappedkeys(subset(hgu133aSYMBOL, Rkeys=x)))) >> # This returns: >> #IGKC IL8 NENF TRIO >> # 15 12 6 9 >> # Which means that there are 9 probesets for TRIO, 6 for NENF, etc... # Now lets check what comes out of remapping >> library(altcdfenvs) >> library(biomaRt) >> mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl") >> getSeq <- function(name) { >> seq <- getSequence(id=name, type="hgnc_symbol", >> seqType="cdna", mart = mart) >> targets <- seq$cdna >> if (is.null(targets)) >> return(character(0)) >> names(targets) <- paste(seq$hgnc_symbol, 1:nrow(seq), sep="-") return(targets) >> } >> targets <- unlist(lapply(geneSymbols, >> getSeq)) >> m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A") >> hg <- toHypergraph(m) >> gn <- toGraphNEL(hg) >> library(RColorBrewer) >> col <- brewer.pal(length(geneSymbols)+1, "Set1") >> tColors <- rep(col[length(col)], length=numNodes(gn)) >> names(tColors) <- nodes(gn) >> for (col_i in 1:(length(col)-1)) { >> node_i <- grep(paste("^", geneSymbols[col_i], >> "-", sep=""), >> names(tColors)) >> tColors[node_i] <- col[col_i] >> } >> nAttrs <- list(fillcolor = tColors) >> library(Rgraphviz) >> plot(gn, "twopi", nodeAttrs=nAttrs) >> # the plot will show that the situation is not as simple for TRIO as # as it is with the other gene symbols. >> > I'd be interested to hear the thoughts and experiences of others. I've >> certainly run into occasions where Affy annotated probesets turn out to represent introns or something other than they purport to be, and I was hoping that APM would solve this problem, but I don't want to use it if it >> > means a massive loss of truly significant data. >> The situation is indeed not always clear... at the moment, I would not advice you to follow blindly any particular mapping, yet have >> alternative >> mappings as part of your routine analysis: depending on the cost of follow-up experiments, or downstream analysis, time should be spent looking at probesets in details. >> Hoping this helps, >> L. >> > Mark >> > >> > >> > >> > -- >> > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> > Indiana University School of Medicine >> > >> > 15032 Hunter Court, Westfield, IN 46074 >> > >> > (317) 490-5129 Work, & Mobile & VoiceMail >> > (317) 663-0513 Home (no voice mail please) >> > >> > ************************************************************** >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 663-0513 Home (no voice mail please) > > ****************************************************************** >
0
11.0 years ago by
Hongfang Liu10
Hongfang Liu10 wrote:
Dear Mark, Thanks for your input with AffyProbeMiner (APM). The following are answers to some of the questions (after consulting with Dr. Barry Zeeberg). First, the difference between Affy-CDFs and APM-CDFs in number of significant probe sets can be caused by the different number of probe sets. Secondly, one of the motivations of our study is the inconsistency among different microarray platforms. We hoped to use remapping to improve the consistency (remapping did improve consistency between different generations of Affymetrix chips). APM as well as several other remapping tools or resources tries to make sure that probes measure the signal of the intended transcripts: i.e., the probe sequence can be mapped to the intended transcript. While our knowledge about splice variants in a specific tissue is still limited, here, APM generates gene-consistent or transcript-consistent probe sets, relative to the global set of all known transcripts where those transcripts were derived from RefSeq and GenBank (passing our QAs: i.e., must align well with the genome (95% aligned with 99% identity)); therefore, APM will not be able to address probes measuring unknown splice variants. You can check my colleague, Mike Ryan's Splice Center, http://www.tigerteamconsulting.com/SpliceCenter/SpliceOverview.jsp to analyze known splice variants for each gene. Also the probes that get discarded in APM are ones that do not fall into a sufficiently large consistent probe set, or that do not map to any gene at all. If the user wants to adjust the threshold to accept small probe sets, then APM will throw away only probes that map to no genes at all. If the user wants to use probe sets that are not consistent, then APM should not be used at all, and the original mappings can be used. We are not saying that all of the discarded probe sets are random. What some of them measure may potentially contain contributions from an inconsistent set of genes. This inconsistency may degrade the reliability of the measurement, to a greater or lesser degree, depending on the relative expression levels of the ?good? probes in that set and the ?inconsistent? probes in that set. In any given tissue, some of the splice variants may not be expressed, so some probe sets that would be inconsistent in a global sense are not inconsistent relative to that specific tissue. We cannot provide all possible tissue-specific CDFs, but we provide the software for users to be able to produce such custom CDFs as desired. I agree with Dr. Gautier that you need to try original affy-CDFs as well as several custom CDFs together with a mixture of microarray data analysis methods. After a dozen of years of research, a lot of work is still needed. For example, in Affymetrix probe set definition, the number of probes per probe set is pretty consistent while in remapped probe sets, the number of probes per probe set can vary dramatically. So existing algorithms that work well with affymetrix-CDFs probably will not work well with custom CDFs. Secondly, if we randomly group probes into probe sets (11 probes per probe set), following the usual microarray data analysis, there are probably still a lot of significantly expressed probe sets. Also, the biochemistry behind microarray is beyond my comprehension. I have no problem to understand microarray can be used to compare two groups (control and treatment) and pick probe sets that are significantly different between the groups. But I still have trouble to understand what exactly the absolute measure of each probe set means; and how people can identify a set of significantly expressed probe sets. Best regards and welcome more discussions on this topic. Mark Kimpel wrote: > I have recently explored the use of alternative CDFs from > affyprobeminer (APM) or a 36 array dataset derived using the Affy > rat2302 chipset. I used both the Affy cdf and the transcript-level > affyprobeminer cdf. I preprocessed using RMA, filtered using an A/P > filter, and statistically analyzed using an appropriate lme model > followed by qvalue FDR correction. I set my FDR threshold at 5%. I > eliminated duplicate genes by picking the one with the lowest p-value. > > Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I > choose only those EntrezGene identifiers present on both cdfs, my > number sig. with the APM cdf was ~1000 and there was a 90% overlap > with the Affy sig. list. My conclusion from the latter observation is > that I am measuring largely the same transcripts/genes with both CDFs. > > I was interested in the ~1000 genes which are annotated with the Affy > CDF but not the APM cdf. Following the logic behind APM, I would > assume that these would be largely incorrectly annotated probesets or > probesets that are not really measuring any "real" transcript. This > list should, then, consist largely of random genes. To test this > hypothesis, I used the Category package to test for > over-representation of GO and KEGG categories in my various lists. > What I found was a huge degree of overlap between: 1. the affy genes > also annotated with APM, 2. the affy genes not annotated with APM, 3. > the genes derived solely from APM. > > My conclusion from this latest observation is that APM is not > annotating a large number of genes/transcripts that are in fact real. > Assuming that APM is correctly throwing out some "junk" probesets, is > it throwing out the baby with the bathwater? > > I'd be interested to hear the thoughts and experiences of others. I've > certainly run into occasions where Affy annotated probesets turn out > to represent introns or something other than they purport to be, and I > was hoping that APM would solve this problem, but I don't want to use > it if it means a massive loss of truly significant data. > > Mark > > > > -- > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 663-0513 Home (no voice mail please) > > ************************************************************** -- =========================== Hongfang Liu, Ph.D. Department of Biostatistics, Bioinformatics, and Biomathematics Georgetown University Medical Center Phone: 202-687-7933 Fax: 202-687-2581