Entering edit mode
Tan, MinHan
▴
180
@tan-minhan-431
Last seen 10.2 years ago
Good afternoon,
I have a question on an optimal strategy for filtering of Affymetrix
data (human tumor tissue) geared towards a purpose of class discovery.
(This does not seem to have been directly addressed in the archive).
Since we are not correlating with any clinical outcomes or markers, I
would not perform filtering in correlation with any of these indices.
A recent paper in PNAS on class discovery of tumor tissue subtypes
(spot cDNA arrays) used the following strategy for filtering: "Full
sample set using genes well measured in * 75% of samples and variably
expressed * 3-fold from the mean in at least two samples (5,153
genes).
Considering this strategy for Affy data- there are no NAs, so it would
seem that it is not necessary to use the first point "well-measured in
> 75% of samples".
Would it make sense to use the second filter 'variably expressed >3
fold from mean in at least 2 samples' for rma normalized data, or
would it be too noisy? (This would probably be too noisy for otherwise
unfiltered MAS5.0 data at low intensities, I suspect) I have been
using a strategy on filtering Affy data based on coefficient of
variation (sd/mean) combined with a minimum of 2 samples with an rma
expression value of 8 (2^8-256), but I am not sure how best to
validate such an approach. I am particularly concerned about the fact
that cv is a single value for each gene derived from across the sample
set, and thus, I may not be able to capture small subclusters, esp.
with a large sample number.
I wonder if this makes sense - based on the assumption that Affymetrix
CEL intensities below 150 are unreliable and indicative of merely a
low value (derived from a couple of sources) - I would aim towards
filtering in genes with at least 2 samples with a intensity of 200-300
(depending on no. of samples) in order to pick up at least a distinct
downregulation, with no issue of 'reliability below 150'. I guess the
next problem is how to capture small changes in intensity (if even
possible) - if I were to use a fold-change filter, I would miss out on
genes that were expressed, say, 10000 (13.2) in 50 samples, and 15000
(13.9) in another 50 samples. If I were to use cv, I may miss out on a
subcluster.
Your advice would be greatly appreciated!
Min-Han Tan
This email message, including any attachments, is for the
so...{{dropped}}