On 1/13/2014 9:47 PM, Thornton, Matthew wrote:
> I am processing some data collected with GeneChip Mouse Gene 2.0 ST
arrays. I am using the Ambion ExFold ERCC controls (Life
Technologies 4456739) These are "spike in" controls consisting of two
'mixes' with the same set of RNA sequences, 92 total, that span 10^6
fold in concentration, furthermore, the difference in concentration
between the two 'mixes' is well defined.
> I have processed the data using the bioconductor package vsn, using
the protocol normalization with "spike-in" controls. I have pulled out
the normalized intensities out for the ERCC probes from both groups
across my samples 3 treatments and 1 wild-type. When I graph 2 log
concentration versus 2 log intensity, I get a sigmoid curve, with a
linear region between a 2 log intensity of 6.5 to 10.5. Is it correct
to assume that this is the 'dynamic range' of the GeneChip for my
experiment? If I have data that is within this range, what would be
the most statistically (and scientifically) satisfying statistics that
I should obtain (and relate) from the dispersion of the controls to
make inference about my data?
I'm not sure what you are asking here. Are you asking if you should
restrict to the data that are in the linear range? Or are you asking
there is some statistical method that you can use to infer something
about your data based on the controls?
I will assume the former. Basically what you are seeing is that there
a good linear relationship between starting mRNA concentrations and
expression levels between 2^6 and 2^11 or so. You could then argue
data beyond those values are less reliable, and I don't think it would
be completely crazy to restrict your analysis based on that
You could do so using something like the kOverA function from
genefilter, but modified somewhat.
There are two issues here. First, you don't want to use the sample
when filtering (e.g., when you filter the data you want to ignore
everything you know about the samples except for the expression
because to incorporate any phenotypic information will bias your
results. Second, there are certain patterns of expression that you
clearly don't want to exclude. For instance, if you have a gene where
half of the samples have expression values < 2^6, and the other half
> 2^11, you don't necessarily want to exclude that gene. You may well
have all treated samples > 2^11, and the wild type < 2^6, in which
you have a clear difference in expression. So really you want to
only those genes for which most or all of the samples are < 2^6 or >
> Additionally from the data there is an expected fold-change between
'mixes' which can be compared to the fold change obtained from data
processing using the average intensity across all samples. In my case
what I see is that an expected 2 fold change, is seen as 1.1 fold
change. What would be the best way to use this information to make
This is a well known phenomenon with microarrays, where the observed
fold changes are compressed downwards. I don't think there is anything
to be done with this information, except to acknowledge that this
phenomenon exists. Certainly if you are using limma to make
you could incorporate the fold change into the test, using the treat
function instead of eBayes, but selecting an lfc value suitably small,
given the fold change compression.
> Is there a forum like Stack Exchange biology or biostars that
bioconductor list patrons prefer? The reason why I am asking is I
because I have graphs which are easier to post in page rather than in
> Any feedback or commentary is greatly appreciated.
> Thank you!
> Matthew E. Thornton
> Research Lab Specialist
> Saban Research Institute
> USC/Children?s Hospital Los Angeles
> 513X, Mail Stop 35
> 4661 W. Sunset Blvd.
> Los Angeles, CA 90027-6020
> matthew.thornton at med.usc.edu
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives:
James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099