I am experimenting preprocessed Affymetrix microarrays expression data matrix (Affymetrix probe-sets in rows (32830 probesets), and RNA samples in columns (735 samples)) for my downstream analysis. Here is how the data looks like:
> dim(eset_HTA20)  32830 735 ## load assayData HTA20_rma <- load("HTA20_RMA.RData") > eset_HTA20[1:10,1:3] Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 1_at 6.062215 6.125023 5.875502 10_at 3.796484 3.805305 3.450245 100_at 5.849338 6.191562 6.550525 1000_at 3.567779 3.452524 3.316134 10000_at 6.166815 5.678373 6.185059 100009613_at 4.443027 4.773199 4.393488 100009676_at 5.836522 6.143398 5.898364 10001_at 6.330018 5.601745 6.137984 10002_at 4.922339 4.711765 4.628124 10003_at 2.689344 2.771010 2.556756
However, I am interested in to reduce the dimension of the dataset by tossing off low expressed genes in the experiment. To do so, I need to quantify the variation of the gene that expressed. I checked the possible approach and using the coefficient of variation (
cv) could be one of the approaches I could try. It is not intuitive for me how to quantify the variation of genes that expressed.
My attempt with
Here is my attempt to find out highly variable genes by using
eset_exprs <- ExpressionSet(eset_HTA20) f1 <- kOverA(5, 200) ffun <- filterfun(f1) wh1 <- genefilter(exprs(eset_exprs), ffun) sum(wh1)
but results are not very meaningful to me because I am expecting at least 5k genes after filtration, and if I set up expression measure as 200, then no gene is left after all. Why? How can I tune the parameter to get a correct number of filtered genes which shows high variation in expression? Any idea? I am expecting the output as data.frame for filtered genes with significant expression variation.
My attempt by using
require(DESeq) lib.size <- estimateSizeFactorsForMatrix(eset_HTA20) ed <- t(t(eset_HTA20)/lib.size) means <- rowMeans(ed) vars <- apply(ed,1,var) cv2 <- vars/means^2 par(mar=c(3.5,3.5,1,1),mgp=c(2,0.65,0),cex=0.9) smoothScatter(log(means),log(cv2))
but I have a hard time to interpret the information on the scatter plot. Can anyone point me out how to refine the above approach? Can anyone point me out how to lay out this procedure on the above expression data? Is there any Bioconductor package to do this kind of task? any idea to make this happen? thanks