Question

Filtering Affymetrix data towards class discovery

0

Entering edit mode

Tan, MinHan ▴ 180

@tan-minhan-431

Last seen 11.4 years ago

Good afternoon, I have a question on an optimal strategy for filtering of Affymetrix data (human tumor tissue) geared towards a purpose of class discovery. (This does not seem to have been directly addressed in the archive). Since we are not correlating with any clinical outcomes or markers, I would not perform filtering in correlation with any of these indices. A recent paper in PNAS on class discovery of tumor tissue subtypes (spot cDNA arrays) used the following strategy for filtering: "Full sample set using genes well measured in * 75% of samples and variably expressed * 3-fold from the mean in at least two samples (5,153 genes). Considering this strategy for Affy data- there are no NAs, so it would seem that it is not necessary to use the first point "well-measured in > 75% of samples". Would it make sense to use the second filter 'variably expressed >3 fold from mean in at least 2 samples' for rma normalized data, or would it be too noisy? (This would probably be too noisy for otherwise unfiltered MAS5.0 data at low intensities, I suspect) I have been using a strategy on filtering Affy data based on coefficient of variation (sd/mean) combined with a minimum of 2 samples with an rma expression value of 8 (2^8-256), but I am not sure how best to validate such an approach. I am particularly concerned about the fact that cv is a single value for each gene derived from across the sample set, and thus, I may not be able to capture small subclusters, esp. with a large sample number. I wonder if this makes sense - based on the assumption that Affymetrix CEL intensities below 150 are unreliable and indicative of merely a low value (derived from a couple of sources) - I would aim towards filtering in genes with at least 2 samples with a intensity of 200-300 (depending on no. of samples) in order to pick up at least a distinct downregulation, with no issue of 'reliability below 150'. I guess the next problem is how to capture small changes in intensity (if even possible) - if I were to use a fold-change filter, I would miss out on genes that were expressed, say, 10000 (13.2) in 50 samples, and 15000 (13.9) in another 50 samples. If I were to use cv, I may miss out on a subcluster. Your advice would be greatly appreciated! Min-Han Tan This email message, including any attachments, is for the so...{{dropped}}

affy affy • 906 views

ADD COMMENT • link 21.9 years ago Tan, MinHan ▴ 180