Question: Agi4x44PreProcess 1.4.0 question: use of genes.rpt.agi() and Gene Sets
0
10.1 years ago by
Massimo Pinto390
Massimo Pinto390 wrote:
Greetings all, I realised that I was carrying forward, in my analysis, multiple measurements for the same gene that had been carried out using independent probes. This is a feature of Agilent arrays, as I understand. However, while it is clear to me that Agi4x44PreProcess offers a function to summarize replicated probes, called summarize.probe(), I cannot see a readily available function that performs a similar treatment to replicated genes, i.e. Gene Sets, as these are called in the Agi4x44 Package. The result of calling > genes.rpt.agi(dd, "hgug4112a.db", raw.data = TRUE, WRITE.html = TRUE, REPORT = TRUE) is an html list of Gene Sets, but these are not summarized to a 'virtual' measurement, like summarize.probe() does for replicated probes. Is there a reason why one would like to carry on multiple probes for a given gene throughout his/her subsequent analysis, including linear modeling and gene ontology? If not, is there a function that performs the median of such repeats? Thank you in advance, Yours Massimo Pinto > sessionInfo() R version 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 locale: C attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] affy_1.22.0 gplots_2.7.0 caTools_1.9 bitops_1.0-4.1 gdata_2.4.2 gtools_2.5.0-1 [7] hgug4112a.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 Agi4x44PreProcess_1.4.0 genefilter_1.24.0 annotate_1.22.0 [13] AnnotationDbi_1.6.0 limma_2.18.0 Biobase_2.4.1 loaded via a namespace (and not attached): [1] affyio_1.11.3 preprocessCore_1.5.3 splines_2.9.1 survival_2.35-4 xtable_1.5-5 Massimo Pinto Post Doctoral Research Fellow Enrico Fermi Centre and Italian Public Health Research Institute (ISS), Rome http://claimid.com/massimopinto
hgug4112a • 770 views
modified 10.1 years ago by Francois Pepin1.3k • written 10.1 years ago by Massimo Pinto390
Answer: Agi4x44PreProcess 1.4.0 question: use of genes.rpt.agi() and Gene Sets
0
10.1 years ago by
Francois Pepin1.3k
Francois Pepin1.3k wrote:
Hi Massimo, I don't know about Agi4x44PreProcess, but Limma can do it with avereps. In the case of Agilent arrays, I would not recommend doing that from the start. The probes mapping to the same genes often do not measure the same thing, they can map different splice variants and some can be pretty far from the 3' end. So for differential analysis, I would suggest keeping them different. For other analyses that assume one probe per gene, such as gene ontology analysis, I would recommend an unbiased method to choose a representative probe per gene, for example the highest variance probe or the one closest to 3' end. If you search in the archives, you can find more advice as this is a common topic. Francois Massimo Pinto wrote: > Greetings all, > > I realised that I was carrying forward, in my analysis, multiple > measurements for the same gene that had been carried out using > independent probes. This is a feature of Agilent arrays, as I > understand. However, while it is clear to me that Agi4x44PreProcess > offers a function to summarize replicated probes, called > summarize.probe(), I cannot see a readily available function that > performs a similar treatment to replicated genes, i.e. Gene Sets, as > these are called in the Agi4x44 Package. > > The result of calling > >> genes.rpt.agi(dd, "hgug4112a.db", raw.data = TRUE, WRITE.html = TRUE, REPORT = TRUE) > > is an html list of Gene Sets, but these are not summarized to a > 'virtual' measurement, like summarize.probe() does for replicated > probes. > > Is there a reason why one would like to carry on multiple probes for a > given gene throughout his/her subsequent analysis, including linear > modeling and gene ontology? If not, is there a function that performs > the median of such repeats? > > Thank you in advance, > > Yours > Massimo Pinto > > >> sessionInfo() > R version 2.9.1 (2009-06-26) > i386-apple-darwin8.11.1 > > locale: > C > > attached base packages: > [1] grid stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] affy_1.22.0 gplots_2.7.0 caTools_1.9 > bitops_1.0-4.1 gdata_2.4.2 gtools_2.5.0-1 > [7] hgug4112a.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 > Agi4x44PreProcess_1.4.0 genefilter_1.24.0 annotate_1.22.0 > [13] AnnotationDbi_1.6.0 limma_2.18.0 Biobase_2.4.1 > > loaded via a namespace (and not attached): > [1] affyio_1.11.3 preprocessCore_1.5.3 splines_2.9.1 > survival_2.35-4 xtable_1.5-5 > > Massimo Pinto > Post Doctoral Research Fellow > Enrico Fermi Centre and Italian Public Health Research Institute (ISS), Rome > http://claimid.com/massimopinto > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi key question regarding your problem is the confidence in the measurement of a single agilent feature. in affy 3' expression arrays a robust measurement is obtained by summarization of several features. for the modern affy gene st arrays the gene-based expression measurement is also obtained by feature summarization across exons (at least this is what the affy epxression console forces you to do). hence, the most intuitive and biologically relevant procedure would be to apply feature summarization accordingly for agilent arrays before doing the statistics. the question how this summarization has to be done cannot easily be answered without analysis of reference samples. my personal experience: there is not a big difference between taking the median signal or just taking the feature with the highest variance. if you are particularly interested in categorizing responders, the variance method is probably more sensitive. best Tobias On Oct 20, 2009, at 4:45 PM, Francois Pepin wrote: > Hi Massimo, > > I don't know about Agi4x44PreProcess, but Limma can do it with > avereps. > > In the case of Agilent arrays, I would not recommend doing that from > the start. The probes mapping to the same genes often do not measure > the same thing, they can map different splice variants and some can > be pretty far from the 3' end. > > So for differential analysis, I would suggest keeping them > different. For other analyses that assume one probe per gene, such > as gene ontology analysis, I would recommend an unbiased method to > choose a representative probe per gene, for example the highest > variance probe or the one closest to 3' end. > > If you search in the archives, you can find more advice as this is a > common topic. > > Francois > > Massimo Pinto wrote: >> Greetings all, >> I realised that I was carrying forward, in my analysis, multiple >> measurements for the same gene that had been carried out using >> independent probes. This is a feature of Agilent arrays, as I >> understand. However, while it is clear to me that Agi4x44PreProcess >> offers a function to summarize replicated probes, called >> summarize.probe(), I cannot see a readily available function that >> performs a similar treatment to replicated genes, i.e. Gene Sets, as >> these are called in the Agi4x44 Package. >> The result of calling >>> genes.rpt.agi(dd, "hgug4112a.db", raw.data = TRUE, WRITE.html = >>> TRUE, REPORT = TRUE) >> is an html list of Gene Sets, but these are not summarized to a >> 'virtual' measurement, like summarize.probe() does for replicated >> probes. >> Is there a reason why one would like to carry on multiple probes >> for a >> given gene throughout his/her subsequent analysis, including linear >> modeling and gene ontology? If not, is there a function that performs >> the median of such repeats? >> Thank you in advance, >> Yours >> Massimo Pinto >>> sessionInfo() >> R version 2.9.1 (2009-06-26) >> i386-apple-darwin8.11.1 >> locale: >> C >> attached base packages: >> [1] grid stats graphics grDevices utils datasets >> methods base >> other attached packages: >> [1] affy_1.22.0 gplots_2.7.0 caTools_1.9 >> bitops_1.0-4.1 gdata_2.4.2 gtools_2.5.0-1 >> [7] hgug4112a.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 >> Agi4x44PreProcess_1.4.0 genefilter_1.24.0 annotate_1.22.0 >> [13] AnnotationDbi_1.6.0 limma_2.18.0 Biobase_2.4.1 >> loaded via a namespace (and not attached): >> [1] affyio_1.11.3 preprocessCore_1.5.3 splines_2.9.1 >> survival_2.35-4 xtable_1.5-5 >> Massimo Pinto >> Post Doctoral Research Fellow >> Enrico Fermi Centre and Italian Public Health Research Institute >> (ISS), Rome >> http://claimid.com/massimopinto >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------------------------------------------------------------------- Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, M?nchen D
Hi all, I was trying to do affyQCReport on a set of affy human U133a chips, but I got the following error. Can anyone help please? Thanks. Chris Error in setQCEnvironment(cdfn) : Could not find array definition file ' hthgu133acdf.qcdef '. Simpleaffy does not know the QC parameters for this array type. See the package vignette for details about how to specify QC parameters manually. Error in plot(qc(object)) : error in evaluating the argument 'x' in selecting a method for function 'plot' ---------------------------------------------------------------------- -------- CONFIDENTIALITY NOTICE:\ The information in this e-mail...{{dropped:15}}
Hi Chris, Man, Chris T. wrote: > Hi all, > > I was trying to do affyQCReport on a set of affy human U133a chips, but I got the following error. Can anyone help please? > > Thanks. > > Chris > > Error in setQCEnvironment(cdfn) : > Could not find array definition file ' hthgu133acdf.qcdef '. Simpleaffy does not know the QC parameters for this array type. > See the package vignette for details about how to specify QC parameters manually. You aren't using hgu133a chips. You are using hthgu133a chips, which aren't the same. However the QC parameters for the hgu133a chips will probably be the same, so you should follow the hint above, which is to look at the package vignette for how you specify the QC parameters manually and then do so, using the parameters for the hgu133a chip (which are already specified). Best, Jim > Error in plot(qc(object)) : > error in evaluating the argument 'x' in selecting a method for function 'plot' > > > -------------------------------------------------------------------- ---------- > CONFIDENTIALITY NOTICE:\ The information in this e-mail...{{dropped:15}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
> Are there any functions written to address each of these topics? > 1) choosing the probe with the largest experimental variation (or with > the maximum average) ?genefilter::findLargest should point you in the right direction. I've found interquartile range (?IQR) works well in many cases in addition to variance. I'm not convinced maximum average is ideal, but I've never tried it. > 2) choosing the probe that maps closest to the 3' end I have yet to find a good solution for this. We've blasted all probes on the genome, but the poly-A isn't always easy to get in an automated way. Maybe someone else found a nicer way to get it done. My strategy is to use strategy 1) for most of the analysis and then check the distance to 3' manually for the genes of interest. We are doing 2 rounds of amplification and tend to have short fragments, so checking the 3' end is necessary. Francois
Hello, Please let me add one more possibility. An alternative to filtering would be a "gene specific" (not probe specific) model that includes interaction of experimental effects (TRT) with probes, and test for such interaction gene-by-gene. When you see a significant interaction of TRT by probe, you can manually align the sequences of the probes and see if they are hinting alternative splicing effects for example. If you do not see a significant interaction, just look at main effect of TRT. I am not aware of a BioC package that can do this because this would imply different incidence matrices per gene (depending on number of replicated probes in each gene) and I have yet to find a package that can do it. But it is doable with "lmer" (lme4_lib) applied using the "by" function (both from R) for example (the moderated t-tests would be a challenge though). cheers, JP Francois Pepin wrote: > > Are there any functions written to address each of these topics? > > > 1) choosing the probe with the largest experimental variation (or with > > the maximum average) > > ?genefilter::findLargest should point you in the right direction. > > I've found interquartile range (?IQR) works well in many cases in > addition to variance. I'm not convinced maximum average is ideal, but > I've never tried it. > > > 2) choosing the probe that maps closest to the 3' end > > I have yet to find a good solution for this. We've blasted all probes > on the genome, but the poly-A isn't always easy to get in an automated > way. Maybe someone else found a nicer way to get it done. > > My strategy is to use strategy 1) for most of the analysis and then > check the distance to 3' manually for the genes of interest. We are > doing 2 rounds of amplification and tend to have short fragments, so > checking the 3' end is necessary. > > Francois > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- ============================= Juan Pedro Steibel Assistant Professor Statistical Genetics and Genomics Department of Animal Science & Department of Fisheries and Wildlife Michigan State University 1205-I Anthony Hall East Lansing, MI 48824 USA Phone: 1-517-353-5102 E-mail: steibelj at msu.edu
Hi On Oct 27, 2009, at 4:13 PM, Massimo Pinto wrote: > Dear Francois, Tobias, and all users > > Thanks to this discussion, and those that I have found on the > Archives, as Francois suggested, I am more aware now of the importance > of avoiding averages of different probes that map the same gene > transcript at different locations. I should now perhaps test - on my > data - the two options that were discussed here in this thread, i.e. > > 1) choosing the probe with the largest experimental variation (or with > the maximum average) if you work with RGLists you could try this function (no guarantee it works.. did not test it thoroughly) # collapseHighestIQR(RG=RGList, x=matrix of signals to operate on) # if you want to select for the probe with the highest IQR in the red channel # call collapseHighestIQR(RG,RG$R) collapseHighestIQR <- function(RG, x) { iqr <- apply(x,1,IQR) names(iqr) <- RG$genes$ProbeName iqrs <- split.default(iqr, RG$genes$GeneName) maxes <- sapply(iqrs, function(x) names(which.max(x))) return(RG[RG$genes\$ProbeName %in% maxes,]) } you should be able to recode it easily for MALists > 2) choosing the probe that maps closest to the 3' end > > Are there any functions written to address each of these topics? i think it is important to knowing the layout of your array before discussing the best strategy. using the standard Drosophila 4x44k expression array i am very careful in selecting the proper probe as the array comprises many probes that map to more than one gene. those are probably not the ideal ones to keep even if they are closest to the 3' end of one or multiple genes. so what i did was a) blast all probe sequences against the current genome b) keep only probes the map to only one gene based on the latest genome annotation c) select per gene the probe closest to the 3' end you do this once. save a list of probe names and in the future you just subset your expression object to contain only probes which are in your 3' probe list. this works very well in drosophila, might be more difficult in organisms with a less comprehensive genome annotation. best Tobias > > I have been searching on GMANE but, other than discussions on the > topic, I am particularly interested in functions that have > addressed/solved these issues. > For selection as at (2) above, I suppose one has to interrogate the > position of probes on chromosomes. > > Thank you, again, in advance > Massimo > > Massimo Pinto > Post Doctoral Research Fellow > Enrico Fermi Centre and Italian Public Health Research Institute > (ISS), Rome > http://claimid.com/massimopinto > > > On Wed, Oct 21, 2009 at 3:48 PM, Francois Pepin > <fpepin at="" cs.mcgill.ca=""> wrote: >> >> The fact that you have to summarize with Affy doesn't mean that it >> applies to other technologies. The Affy chips need this because >> they have shorter oligos (25bp) but the Agilent ones are longer >> (60bp) and more reliable than individual affy probes. >> >> I have to disagree with that being the most biologically relevant. >> As I said, a lot of the probes for the same gene will not be >> measuring the same thing, some will be differential splice sites, >> or preferentially tracking pseudo-genes, etc. From talking to >> Agilent scientists, one of the criterias for keeping different >> probes for a same gene is that they give different readings on some >> of their test samples. Otherwise, they just take the closest one to >> 3'. >> >> I have cases where both probes for a given gene show differential >> expression in opposite directions. There's one I believe, the other >> one is a probably fluke, but combining them would have been be a >> bad idea. >> >> Francois >> >> Tobias Straub wrote: >>> >>> Hi >>> >>> key question regarding your problem is the confidence in the >>> measurement of a single agilent feature. in affy 3' expression >>> arrays a robust measurement is obtained by summarization of >>> several features. for the modern affy gene st arrays the gene- >>> based expression measurement is also obtained by feature >>> summarization across exons (at least this is what the affy >>> epxression console forces you to do). >>> >>> hence, the most intuitive and biologically relevant procedure >>> would be to apply feature summarization accordingly for agilent >>> arrays before doing the statistics. the question how this >>> summarization has to be done cannot easily be answered without >>> analysis of reference samples. my personal experience: there is >>> not a big difference between taking the median signal or just >>> taking the feature with the highest variance. if you are >>> particularly interested in categorizing responders, the variance >>> method is probably more sensitive. >>> >>> best >>> Tobias >>> >>> On Oct 20, 2009, at 4:45 PM, Francois Pepin wrote: >>> >>>> Hi Massimo, >>>> >>>> I don't know about Agi4x44PreProcess, but Limma can do it with >>>> avereps. >>>> >>>> In the case of Agilent arrays, I would not recommend doing that >>>> from the start. The probes mapping to the same genes often do not >>>> measure the same thing, they can map different splice variants >>>> and some can be pretty far from the 3' end. >>>> >>>> So for differential analysis, I would suggest keeping them >>>> different. For other analyses that assume one probe per gene, >>>> such as gene ontology analysis, I would recommend an unbiased >>>> method to choose a representative probe per gene, for example the >>>> highest variance probe or the one closest to 3' end. >>>> >>>> If you search in the archives, you can find more advice as this >>>> is a common topic. >>>> >>>> Francois >>>> >>>> Massimo Pinto wrote: >>>>> >>>>> Greetings all, >>>>> I realised that I was carrying forward, in my analysis, multiple >>>>> measurements for the same gene that had been carried out using >>>>> independent probes. This is a feature of Agilent arrays, as I >>>>> understand. However, while it is clear to me that >>>>> Agi4x44PreProcess >>>>> offers a function to summarize replicated probes, called >>>>> summarize.probe(), I cannot see a readily available function that >>>>> performs a similar treatment to replicated genes, i.e. Gene >>>>> Sets, as >>>>> these are called in the Agi4x44 Package. >>>>> The result of calling >>>>>> >>>>>> genes.rpt.agi(dd, "hgug4112a.db", raw.data = TRUE, WRITE.html = >>>>>> TRUE, REPORT = TRUE) >>>>> >>>>> is an html list of Gene Sets, but these are not summarized to a >>>>> 'virtual' measurement, like summarize.probe() does for replicated >>>>> probes. >>>>> Is there a reason why one would like to carry on multiple probes >>>>> for a >>>>> given gene throughout his/her subsequent analysis, including >>>>> linear >>>>> modeling and gene ontology? If not, is there a function that >>>>> performs >>>>> the median of such repeats? >>>>> Thank you in advance, >>>>> Yours >>>>> Massimo Pinto >>>>>> >>>>>> sessionInfo() >>>>> >>>>> R version 2.9.1 (2009-06-26) >>>>> i386-apple-darwin8.11.1 >>>>> locale: >>>>> C >>>>> attached base packages: >>>>> [1] grid stats graphics grDevices utils datasets >>>>> methods base >>>>> other attached packages: >>>>> [1] affy_1.22.0 gplots_2.7.0 caTools_1.9 >>>>> bitops_1.0-4.1 gdata_2.4.2 gtools_2.5.0-1 >>>>> [7] hgug4112a.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 >>>>> Agi4x44PreProcess_1.4.0 genefilter_1.24.0 >>>>> annotate_1.22.0 >>>>> [13] AnnotationDbi_1.6.0 limma_2.18.0 Biobase_2.4.1 >>>>> loaded via a namespace (and not attached): >>>>> [1] affyio_1.11.3 preprocessCore_1.5.3 splines_2.9.1 >>>>> survival_2.35-4 xtable_1.5-5 >>>>> Massimo Pinto >>>>> Post Doctoral Research Fellow >>>>> Enrico Fermi Centre and Italian Public Health Research Institute >>>>> (ISS), Rome >>>>> http://claimid.com/massimopinto >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> ---------------------------------------------------------------------- >>> Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, >>> M?nchen D >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> ---------------------------------------------------------------------- Dr. Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, M?nchen D