Statistics questions regarding the use of Ambion ExFold ERCC standards with Affmetrix ST arrays.

0

Entering edit mode

Matthew Thornton ▴ 390

@matthew-thornton-5564

Last seen 7 weeks ago

USA, Los Angeles, USC

Hello! I am processing some data collected with GeneChip Mouse Gene 2.0 ST arrays. I am using the Ambion ExFold ERCC controls (Life Technologies 4456739) These are "spike in" controls consisting of two 'mixes' with the same set of RNA sequences, 92 total, that span 10^6 fold in concentration, furthermore, the difference in concentration between the two 'mixes' is well defined. I have processed the data using the bioconductor package vsn, using the protocol normalization with "spike-in" controls. I have pulled out the normalized intensities out for the ERCC probes from both groups across my samples 3 treatments and 1 wild-type. When I graph 2 log concentration versus 2 log intensity, I get a sigmoid curve, with a linear region between a 2 log intensity of 6.5 to 10.5. Is it correct to assume that this is the 'dynamic range' of the GeneChip for my experiment? If I have data that is within this range, what would be the most statistically (and scientifically) satisfying statistics that I should obtain (and relate) from the dispersion of the controls to make inference about my data? Additionally from the data there is an expected fold-change between 'mixes' which can be compared to the fold change obtained from data processing using the average intensity across all samples. In my case what I see is that an expected 2 fold change, is seen as 1.1 fold change. What would be the best way to use this information to make inference? Is there a forum like Stack Exchange biology or biostars that bioconductor list patrons prefer? The reason why I am asking is I because I have graphs which are easier to post in page rather than in list format. Any feedback or commentary is greatly appreciated. Thank you! Sincerely, Matt Matthew E. Thornton Research Lab Specialist Saban Research Institute USC/Children?s Hospital Los Angeles 513X, Mail Stop 35 4661 W. Sunset Blvd. Los Angeles, CA 90027-6020 matthew.thornton at med.usc.edu

Normalization vsn graph Normalization vsn graph • 1.7k views

ADD COMMENT • link updated 12.0 years ago by James W. MacDonald 68k • written 12.0 years ago by Matthew Thornton ▴ 390

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Matt, On 1/13/2014 9:47 PM, Thornton, Matthew wrote: > Hello! > > I am processing some data collected with GeneChip Mouse Gene 2.0 ST arrays. I am using the Ambion ExFold ERCC controls (Life Technologies 4456739) These are "spike in" controls consisting of two 'mixes' with the same set of RNA sequences, 92 total, that span 10^6 fold in concentration, furthermore, the difference in concentration between the two 'mixes' is well defined. > > I have processed the data using the bioconductor package vsn, using the protocol normalization with "spike-in" controls. I have pulled out the normalized intensities out for the ERCC probes from both groups across my samples 3 treatments and 1 wild-type. When I graph 2 log concentration versus 2 log intensity, I get a sigmoid curve, with a linear region between a 2 log intensity of 6.5 to 10.5. Is it correct to assume that this is the 'dynamic range' of the GeneChip for my experiment? If I have data that is within this range, what would be the most statistically (and scientifically) satisfying statistics that I should obtain (and relate) from the dispersion of the controls to make inference about my data? I'm not sure what you are asking here. Are you asking if you should just restrict to the data that are in the linear range? Or are you asking if there is some statistical method that you can use to infer something about your data based on the controls? I will assume the former. Basically what you are seeing is that there is a good linear relationship between starting mRNA concentrations and expression levels between 2^6 and 2^11 or so. You could then argue that data beyond those values are less reliable, and I don't think it would be completely crazy to restrict your analysis based on that observation. You could do so using something like the kOverA function from genefilter, but modified somewhat. There are two issues here. First, you don't want to use the sample types when filtering (e.g., when you filter the data you want to ignore everything you know about the samples except for the expression values), because to incorporate any phenotypic information will bias your results. Second, there are certain patterns of expression that you clearly don't want to exclude. For instance, if you have a gene where half of the samples have expression values < 2^6, and the other half are > 2^11, you don't necessarily want to exclude that gene. You may well have all treated samples > 2^11, and the wild type < 2^6, in which case you have a clear difference in expression. So really you want to exclude only those genes for which most or all of the samples are < 2^6 or > 2^11. > > Additionally from the data there is an expected fold-change between 'mixes' which can be compared to the fold change obtained from data processing using the average intensity across all samples. In my case what I see is that an expected 2 fold change, is seen as 1.1 fold change. What would be the best way to use this information to make inference? This is a well known phenomenon with microarrays, where the observed fold changes are compressed downwards. I don't think there is anything to be done with this information, except to acknowledge that this phenomenon exists. Certainly if you are using limma to make comparisons you could incorporate the fold change into the test, using the treat function instead of eBayes, but selecting an lfc value suitably small, given the fold change compression. Best, Jim > > Is there a forum like Stack Exchange biology or biostars that bioconductor list patrons prefer? The reason why I am asking is I because I have graphs which are easier to post in page rather than in list format. > > Any feedback or commentary is greatly appreciated. > > Thank you! > > Sincerely, > > Matt > > Matthew E. Thornton > > Research Lab Specialist > Saban Research Institute > > USC/Children?s Hospital Los Angeles > 513X, Mail Stop 35 > 4661 W. Sunset Blvd. > Los Angeles, CA 90027-6020 > > matthew.thornton at med.usc.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 12.0 years ago James W. MacDonald 68k

Login before adding your answer.