Entering edit mode
>On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> > Hi,
> > Just a quick question about low expression levels on Affy systems
- I
> hope it's not too off-topic; it is about normalisation and data
analysis...
> > I've heard a lot of people advocating that it's a good idea to
perform
> an initial filtering on either Present Marginal or Absent calls, or
on
> gene-expression levels (so that only genes with an expression > 40,
say,
> after scaling to a TGT of 100 using the MAS5.0 algorithm, are part
of the
> further analysis). Firstly, am I right in thinking that this is to
> eliminate data that are too close to the background noise level of
the system.
> >
> > I wanted to canvas opinion as to whether people feel we need to do
this
> if we have replicates and are using statistical tests - rather than
just
> fold-changes - to identify 'interesting' genes. Does the statistical
> testing do this job for us?
>
>Hi,
> In my opinion you should always do some sort of non-specific
> filtering. What you have described is one form of it, others
include
> removing genes that show little or no variability across samples.
> I think of non-specific filtering as filtering without reference
to
> phenotype (of any sort).
>
> There are a number of reasons for doing this, some motivated by
the
> biology and some by the statistics.
>
> First off, especially for Affy, the chip is designed for all
tissue
> types but a commonly held belief is that only about 40% of the
genome
> is expressed in any specific tissue type. So, for any experiment
you
> will have a pretty large number of probes for genes that are not
> expressed in the tissue you are looking at.
>
> From a statistical perspective you need to be a little bit
cautious
> if you are going to standardize genes across samples (this is
pretty
> common). If you do not remove those genes that show little
> variability before standardization then you have just elevated the
> noise to the same status as the signal (and if the 40% estimate is
> right then you actually have more noise than signal - not too
> pleasant).
>
> Using a test statistic (such as a t-test) does not help, since
that
> measures the between group differences relative to the variation
(so
> if there is very little variation and a small difference in mean,
> well you get an enormous t-statistic and a small p-value; of
course
> in this case looking at the "fold-change" or the size of the
effect
> will indicate a problem, but not many people check all the things
> that need checking (and what to check depends on the test that
> you have just carried out). It seems to me to be much easier to
just
> filter those genes with no expression or little variation out at
the
> very start.
All good points. One thing that does help though is to use a
t-statistic
(or F or posterior odds or whatever) in which some form of shrinkage
to a
common value has been applied to the standard deviations. This has the
effect of offsetting the smaller sample variances to be not less than
a
certain size. We have found that empirical Bayes t-statistics do a
good job
of eliminating the low-signal, low-variability genes without needing
an
explicit filtering step.
I have also wondered about the biological arguement that many genes
might
be not represented in a particular sample, and whether this means that
non-specific filtering should be applied. I guess the reason that I
don't
do it at the moment is that I'm somewhat uneasy about possible
selection
bias in the filtered intensities and standard deviations. Another
factor
which allows us to avoid non-specific filtering is the use of
background
correction methods which ensure that the lower intensities are not
especially variable.
Just some other thoughts.
Cheers
Gordon
> If they don't show any variation across samples they can't help to
> classify or to cluster (there is no information about any
phenotype
> contained in them).
>
> Robert
>
>
> >
> > Crispin
> >
> > --------------------------------------------------------
> >
> >
> > This email is confidential and intended solely for the use of
th...
> {{dropped}}
> >
> > _______________________________________________
> > Bioconductor mailing list
> >
> <https: www.stat.math.ethz.ch="" mailman="" listinfo="" bioconductor="">Biocond
uctor
> at stat.math.ethz.ch
> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
>--
>+--------------------------------------------------------------------
-------+
>| Robert Gentleman phone : (617) 632-5250
|
>| Associate Professor fax: (617) 632-2444
|
>| Department of Biostatistics office: M1B20
|
>| Harvard School of Public Health email:
><https: www.stat.math.ethz.ch="" mailman="" listinfo="" bioconductor="">rgentlem
at
>jimmy.harvard.edu |
>+--------------------------------------------------------------------
-------+