Filtering Affymetrix data towards class discovery
0
0
Entering edit mode
Tan, MinHan ▴ 180
@tan-minhan-431
Last seen 8.1 years ago
Good afternoon, I have a question on an optimal strategy for filtering of Affymetrix data (human tumor tissue) geared towards a purpose of class discovery. (This does not seem to have been directly addressed in the archive). Since we are not correlating with any clinical outcomes or markers, I would not perform filtering in correlation with any of these indices. A recent paper in PNAS on class discovery of tumor tissue subtypes (spot cDNA arrays) used the following strategy for filtering: "Full sample set using genes well measured in * 75% of samples and variably expressed * 3-fold from the mean in at least two samples (5,153 genes). Considering this strategy for Affy data- there are no NAs, so it would seem that it is not necessary to use the first point "well-measured in > 75% of samples". Would it make sense to use the second filter 'variably expressed >3 fold from mean in at least 2 samples' for rma normalized data, or would it be too noisy? (This would probably be too noisy for otherwise unfiltered MAS5.0 data at low intensities, I suspect) I have been using a strategy on filtering Affy data based on coefficient of variation (sd/mean) combined with a minimum of 2 samples with an rma expression value of 8 (2^8-256), but I am not sure how best to validate such an approach. I am particularly concerned about the fact that cv is a single value for each gene derived from across the sample set, and thus, I may not be able to capture small subclusters, esp. with a large sample number. I wonder if this makes sense - based on the assumption that Affymetrix CEL intensities below 150 are unreliable and indicative of merely a low value (derived from a couple of sources) - I would aim towards filtering in genes with at least 2 samples with a intensity of 200-300 (depending on no. of samples) in order to pick up at least a distinct downregulation, with no issue of 'reliability below 150'. I guess the next problem is how to capture small changes in intensity (if even possible) - if I were to use a fold-change filter, I would miss out on genes that were expressed, say, 10000 (13.2) in 50 samples, and 15000 (13.9) in another 50 samples. If I were to use cv, I may miss out on a subcluster. Your advice would be greatly appreciated! Min-Han Tan This email message, including any attachments, is for the so...{{dropped}}
affy affy • 568 views
ADD COMMENT

Login before adding your answer.

Traffic: 351 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6