Filtering by variance, IQR, etc.

0

Entering edit mode

Mark W Kimpel ▴ 830

@mark-w-kimpel-2027

Last seen 9.6 years ago

I have been using what I consider to be non-biased filtering of low-variance genes using the method described in "Bioinformatics and Computational Biology Solutions using R and Bioconductor", R. Gentleman, et al., page 233 for some time and have recently run into some resistance from a colleague who claims that this type of filtering distorts FDR calculations because it introduces bias. His reasoning is that, since this method tends to filter out genes with higher p values and/or lower fold changes, that it is sort of a sneaky way of accomplishing just that. Of course, filtering by phenotype does introduce bias, but in this case I believe that by filtering based on the a priori assumption that we just aren't that interested in low variance genes for biologic reasons (even if statistically significant they will have very low fold changes and thus be of questionable meaning) that we aren't violating the statistical underpinnings of the analysis. I need some help in justifying this filtering step. Does anyone know of a peer-reviewed reference that gives a theoretical justification for its use of of any empiric experiments that show that it is legit? Thanks, Mark -- Mark W. Kimpel MD Neuroinformatics Department of Psychiatry Indiana University School of Medicine

• 2.9k views

ADD COMMENT • link updated 17.1 years ago by Jenny Drnevich ★ 2.2k • written 17.1 years ago by Mark W Kimpel ▴ 830

0

Entering edit mode

Jenny Drnevich ★ 2.2k

@jenny-drnevich-382

Last seen 9.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070403/ 2ccb8be1/attachment.pl

ADD COMMENT • link 17.1 years ago Jenny Drnevich ★ 2.2k

0

Entering edit mode

A little over a week ago I posed the question of what effect, if any, the method of filtering out genes based on low levels of variance (as measured by IQR) has on the calculation of the FDR. I appreciated Jenny's comments (below) and read over the posts by Robert Gentleman and others of Feb, 2006 (see Jenny's note far below). Not being a theoritician, I decided to pursue some simulations of my own to better understand the process and have discovered, to my chagrin, that filtering by IQR does seem to cause a disturbance to the p value distribution and an under-reporting of the true FDR. Immediately below is are the results I obtained with my simulations and below that is a script, with functions, which can be used to duplicate my results in R. I would ask that interested parties review my script and results carefully and comment on my methods and my conclusions that it is perhaps better not to do the IQR filtering, but to accept the fact that if we do not do the filtering, we just have to live with a higher FDR. I am actually hoping that someone will prove me wrong, but I do believe that I am correct. I look forward to more healthy debate on this topic. Mark This output is tab-delimited and will look fine if pasted into Excel: parameter mean std. dev mean std. dev mean std. dev mean std. dev number of simulation runs 20.00 0.00 20.00 0.00 20.00 0.00 20.00 0.00 initial number of genes 30000.00 0.00 30000.00 0.00 30000.00 0.00 30000.00 0.00 initial percentage of significant genes 0.00 0.00 0.00 0.00 3.00 0.00 3.00 0.00 true number of significant genes 0.00 0.00 0.00 0.00 900.00 0.00 900.00 0.00 initial seeded fold change min. 1.00 0.00 1.00 0.00 1.10 0.00 1.10 0.00 initial seeded fold change max 1.00 0.00 1.00 0.00 2.25 0.00 2.25 0.00 number of arrays 12.00 0.00 12.00 0.00 12.00 0.00 12.00 0.00 initial mean expression 1000.04 0.20 1000.04 0.28 1010.06 0.19 1010.15 0.27 initial c.v. 0.13 0.00 0.13 0.00 0.13 0.00 0.13 0.00 initial actual fold change min. 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 initial actual fold change max. 1.39 0.03 1.40 0.03 2.68 0.09 2.66 0.08 IQR percentile of filter 0.00 0.00 25.00 0.00 0.00 0.00 25.00 0.00 post-fitler number of genes 30000.00 0.00 22501.00 0.00 30000.00 0.00 22501.00 0.00 post-filter mean expression 1000.04 0.20 998.47 0.31 1010.06 0.19 1011.91 0.27 post-filter c.v. 0.13 0.00 0.13 0.00 0.13 0.00 0.14 0.00 post-filter fold change min. 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 post-filter fold change max. 1.39 0.03 1.40 0.03 2.68 0.09 2.66 0.08 Storey q value FDR 0.20 0.00 0.20 0.00 0.20 0.00 0.20 0.00 sig. genes number 0.30 0.47 0.40 0.75 923.05 16.95 980.85 23.10 sig. genes mean expression NA NA NA NA 1309.78 5.56 1292.99 6.02 sig. genes c.v. NA NA NA NA 0.28 0.00 0.27 0.00 sig. genes fold change min. NA NA NA NA 1.10 0.01 1.12 0.01 sig. genes fold change max. NA NA NA NA 2.68 0.09 2.66 0.08 sig. genes true FDR NA NA NA NA 0.19 0.01 0.23 0.02 sig. genes true FNR NA NA NA NA 0.17 0.01 0.14 0.01 #Script to simulate the effects of IQR filtering on the FDR and positive gene characteristics # of a simulated dataset ###################################################################### ############################ ###################################################################### ########################### # describe perform t-testing with FDR correction, make histogram of p values, # determine true FDR, look at Fold Change (FC) characteristics of significant genes under # 4 different conditions: 1. no induced FC, no filtering 2. no induced FC, IQR filtering, # 3. induced FC, no IQR filtering, 4. induced FC, yes IQR filtering ###################################################################### ########################### #Input parameters #num.sim.genes = 30000 #number of simulated genes on a array #num.arrays = 12 #number of simulated arrays, must be even with n1=n2 for this 2 group simulation #mean.exprs = 1000 #mean simulated gene expression, will be normally distributed #coeff.var= 0.13, #coefficient of variation within and between genes #FC.range = c(1,1) #uniform range of fold change between groups #percent.genes.real.change = 0 #percentage of genes that have a simulated fold change #IQR.filt.per = 0 #percentage of genes to apply the IQR filter to, if any (zero means no genes filtered) #FDR.cutoff = 0.20 #BH FDR cutoff level for significance #num.runs = 20 #number of times the simulation is run to determine the std. dev. for each output value ###################################################################### ########################### #Output: a .pdf file of the p value distribution for the last simulation run and a tab-delim text file with .cvs file #extension (works with OpenOffice, you might want to change) that contains the following rows #Initial State #"number of simulation runs" #number of total simulations run for a given set of input parameters #"initial number of genes" #how many simulated genes on each array #"initial percentage of significant genes" #percentage of genes with simulated spike-in fold changes #"true number of significant genes" #number of genes with simulated spike-in fold changes #"initial seeded fold change min." #the minimum spiked-in fold change #"initial seeded fold change max" #the maximum spiked-in fold change #"number of arrays" #number of simulated arrays #"initial mean expression" #initial mean expression level #"initial c.v." #initial mean within-gene coeffecient of variation #"initial actual fold change min." #initial actual minimum fold change #"initial actual fold change max." #initial actual maximum fold change #Post-filtered state #"IQR percentile of filter" #percentage of the IQR distribution below which genes will be filtered out #"post-fitler number of genes" #number of remaining genes after IQR filtration #"post-filter mean expression" #mean expression of genes remaining after filtration #"post-filter c.v." #mean c.v. of genes remaining after filtration #"post-filter fold change min." #post-filtration actual minimum fold change #"post-filter fold change max." #post-filtration actual maximum fold change #Significant Genes #"BH value FDR" #set FDR threshold #"sig. genes number" #total number of genes found significant #"sig. genes mean expression" #mean expression of significant genes #"sig. genes c.v." #mean with-gene c.v. of signficant genes #"sig. genes fold change min." #minimum fold change of significant genes #"sig. genes fold change max." #maximum fold change of significant genes #"sig. genes true FDR" # the true FDR #"sig. genes true FNR" #the true FNR ############################################################## #LOAD THE FUNCTIONS BELOW BEFORE RUNNING THE FOLLOWING SIMULATIONS IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, coeff.var= 0.13, FC.range = c(1,1), IQR.filt.per = 0, FDR.cutoff = 0.20, percent.genes.real.change = 0, num.runs = 20) IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, coeff.var= 0.13, FC.range = c(1,1), IQR.filt.per = 25, FDR.cutoff = 0.20, percent.genes.real.change = 0, num.runs = 20) IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, coeff.var= 0.13, FC.range = c(1.1, 2.25), IQR.filt.per = 0, FDR.cutoff = 0.20, percent.genes.real.change = 3, num.runs = 20) IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, coeff.var= 0.13, FC.range = c(1.1, 2.25), IQR.filt.per = 25, FDR.cutoff = 0.20, percent.genes.real.change = 3, num.runs = 20) ############################################################## ############################################################## ############################################################## ### ### END OF SIMULATION ### ############################################################## ############################################################## ############################################################## ###################################################################### ########################### # SIMULATION FUNCTIONS ###################################################################### ########################### IQR.sim.func <- function(num.sim.genes, num.arrays, mean.exprs, coeff.var, FC.range, IQR.filt.per, FDR.cutoff, percent.genes.real.change, num.runs) { #a couple of utility functions to use later on #fold change calculator abs.fc.calc.func <- function(m1,m2){ fc <- mean(m2)/mean(m1); if (fc < 1) {fc <- 1/fc} ; round(fc, digits = 2); fc} #coefficient of variation calculator cv.func <- function(vec){cv <- sd(vec)/mean(vec); cv} for (run.num in 1:num.runs) { ###################################################################### ############################ #construct matrix of random normal numbers cv <- coeff.var mat <-abs(matrix(rnorm((num.sim.genes * num.arrays), mean = mean.exprs, sd = (mean.exprs * cv)), nrow = num.sim.genes, ncol = num.arrays)) # "Seed" the matrix with genes with uniform dist. of fold changes of range min.change-max change per.seed <- percent.genes.real.change #percentage of genes seeded genes.seed <- (per.seed/100) * num.sim.genes #number of genes seeded #marker for which genes are "true positives"(TRUE) and which are "true negatives"(FALSE) genes.seed.logical <- c(rep(TRUE, genes.seed), rep(FALSE,(num.sim.genes - genes.seed))) min.change <- FC.range[1] max.change <- FC.range[2] sim.FC <- seq(from = min.change, to = max.change, length.out = genes.seed) if (genes.seed != 0) { for (i in 1:genes.seed) # Do seeding { mat[i,1:(num.arrays/2)] <- sim.FC[i] * mat[i, 1:(num.arrays/2)]} } # Characteristics of all genes mean.exprs.all <- mean(mat) mean.cv.exprs.all <-mean(apply(mat, 1, cv.func)) fc.vec.all <- rep(NA, nrow(mat)) for (i in 1:nrow(mat)){fc.vec.all[i] <- abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) + 1):num.arrays])} range.fc.all <- quantile(fc.vec.all, p = seq(0,1)) #IQR filtering IQR.cutoff <- IQR.filt.per #the percentile of IQR below which will be filtered out IQR.vec <- apply(log2(mat), 1, IQR) #compute IQR for each row of mat ("gene"), needs to be log2 IQR.per <- 100*rank(IQR.vec)/length(IQR.vec) #compute IQR percentile for each "gene" IQR.filt <- IQR.per >= IQR.cutoff #T/F filter mat <- mat[IQR.filt,] genes.seed.logical <- genes.seed.logical[IQR.filt] # Characteristics of post-filtered genes mean.exprs.filtered <- mean(mat) mean.cv.exprs.filtered <- mean(apply(mat, 1, cv.func)) fc.vec.filtered <- rep(NA, nrow(mat)) for (i in 1:nrow(mat)){fc.vec.filtered[i] <- abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) + 1):num.arrays])} range.fc.filtered <- quantile(fc.vec.filtered, p = seq(0,1)) # Do t-testing p.vec <- rep(NA, nrow(mat)) for (i in 1:length(p.vec)) { p.vec[i] <- t.test(x = mat[i,1:(num.arrays/2)], y = mat[i,((num.arrays/2) + 1):num.arrays], alternative = "two.sided", var.equal = TRUE)$p.value } hist(p.vec, breaks = 100, main=paste("Histogram of p values. IQR.filt.per = ", IQR.filt.per, ". FC.range is ", FC.range[1], " to ", FC.range[2], sep="") ,xlab="p value") require(geneplotter, quiet=T) savepdf(fn = paste("hist.p.values.IQR.filt.per", IQR.filt.per, "FC.range.", FC.range[1], FC.range[2], sep="_"), dir = getwd()) # FDR correction using BH method p.adj.vec <- p.adjust(p.vec, method = "BH") sig.filt <- p.adj.vec <= FDR.cutoff tot.sig <- sum(sig.filt) #total of sig. genes tot.true.neg.sig <- sum(sig.filt & !genes.seed.logical) #true negatives in genes called sig. true.FDR <- tot.true.neg.sig/tot.sig tot.true.pos.not.sig <- sum(!sig.filt & genes.seed.logical) #true positives in genes called not sig. true.FNR <- tot.true.pos.not.sig/sum(genes.seed.logical) #characteristics of total positive genes if (sum(sig.filt) > 1) { mat <- mat[sig.filt,] mean.exprs.sig <- mean(mat) mean.cv.exprs.sig <- mean(apply(mat, 1, cv.func)) fc.vec.sig <- rep(NA, nrow(mat)) for (i in 1:nrow(mat)){fc.vec.sig[i] <- abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) + 1):num.arrays])} range.fc.sig <- quantile(fc.vec.sig, p = seq(0,1)) } else { mean.exprs.sig <- NA; mean.cv.exprs.sig <- NA; range.fc.sig <-rep(NA,2)} ############################################################# ############################################################ # Create output datastructure if (run.num == 1) { parameter <- c( #Initial State "number of simulation runs", "initial number of genes", "initial percentage of significant genes", "true number of significant genes", "initial seeded fold change min.", "initial seeded fold change max", "number of arrays", "initial mean expression", "initial c.v.", "initial actual fold change min.", "initial actual fold change max.", #Post-filtered state "IQR percentile of filter", "post-fitler number of genes", "post-filter mean expression", "post-filter c.v.", "post-filter fold change min.", "post-filter fold change max.", #Significant Genes "BH FDR", "sig. genes number", "sig. genes mean expression", "sig. genes c.v.", "sig. genes fold change min.", "sig. genes fold change max.", "sig. genes true FDR", "sig. genes true FNR") out.mat <- matrix(ncol = 1, nrow = length(parameter)) } #Create output of i and bind to out.mat output.vec <- c(num.runs, num.sim.genes, per.seed, (num.sim.genes*per.seed/100), FC.range[1], FC.range[2], num.arrays, mean.exprs.all, mean.cv.exprs.all, range.fc.all[1], range.fc.all[2], #Post-filtered state IQR.filt.per, sum(IQR.filt), mean.exprs.filtered, mean.cv.exprs.filtered, range.fc.filtered[1], range.fc.filtered[2], #Significant Genes FDR.cutoff, tot.sig, mean.exprs.sig, mean.cv.exprs.sig, range.fc.sig[1], range.fc.sig[2], true.FDR, true.FNR) if (run.num ==1){out.mat[,1] <- output.vec} else {out.mat <- cbind(out.mat, output.vec)} print(paste("finished run ", run.num, sep = "")) } final.out.mat <- matrix(nrow = length(parameter), ncol = 3) final.out.mat[,1] <- parameter colnames(final.out.mat) <- c("parameter", "mean", "std. dev") for (i in 1:nrow(final.out.mat)) { err.out <- try(out.tmp <- sd(out.mat[i,]), TRUE) if(inherits(err.out, "try-error")) {final.out.mat[i,2:3]<-rep(NA,2)} else {final.out.mat[i,2:3]<-c(mean(out.mat[i,]), sd(out.mat[i,]))} } write.table(final.out.mat, "simulation.output.csv", sep="\t", col.names=T, row.names=F, append=T) } ############################################################## ############################################################## ############################################################## ### ### ### ### END OF R Functions ### ### ### ############################################################## ############################################################## ############################################################## Jenny Drnevich wrote: > Hi Mark, > > I also have been suspicious of filtering out low-variance genes. There > are 1-2 issues, depending on how you calculate the test statistic. The > first is the possible distortion of FDR calculations. On page 234 of the > BioC book you mention, Scholtens & Heydebreck state: "If the truly > differentially expressed genes are overrepresented among those selected > in the filtering step, the FDR associated with a certain threshold of > the test statistic will be lowered due to filtering." I believe this is > what your colleague was complaining about, as I can't see how filtering > based on variance could do anything except overrepresent DEG. But - is > this a good or bad thing? > > The second issue may not matter depending on how you calculate the test > statistic. If you use limma, which uses a Bayesian shrinkage based on > the variance of all genes, filtering out low-variance genes will lead to > a higher average gene variance, and hence less shrinkage. Scholtens & > Heydebreck also point this out on pg. 234: "Also concerning the > /variability across samples/, a higher overall variance of the > differentially expressed genes may be expected, because their > between-class variance adds to their within-class variance." I've > looked at this in some of my data sets, and the t values calculated by > eBayes() after filtering are generally lower than they were before > filtering. Now, the boost to FDR correction for a fewer number of genes > tends to cancel out the lower p-values, but not always. > > I also would be interested in any theoretical papers on this that > confirm or rebuke my intuitions. I myself have no problem doing a > conservative filtering to remove genes that likely aren't expressed in > any of the samples, but I don't filter based on variance because I use > the Bayesian correction in limma. There was a lengthy exchange on this > top in Feb 2006 (subject: Invalid fold-filter). Robert Gentleman > https://stat.ethz.ch/pipermail/bioconductor/2006-February/011988.html > said you can do simulations to convince yourself that filtering on > variance doesn't really bias the results, at least in regards to the > first issue I mention. The effects on the Bayesian shrinkage I don't > think have been addressed... > > Cheers, > Jenny > > At 11:36 PM 4/2/2007, Mark W Kimpel wrote: >> I have been using what I consider to be non-biased filtering of >> low-variance genes using the method described in "Bioinformatics and >> Computational Biology Solutions using R and Bioconductor", R. Gentleman, >> et al., page 233 for some time and have recently run into some >> resistance from a colleague who claims that this type of filtering >> distorts FDR calculations because it introduces bias. His reasoning is >> that, since this method tends to filter out genes with higher p values >> and/or lower fold changes, that it is sort of a sneaky way of >> accomplishing just that. Of course, filtering by phenotype does >> introduce bias, but in this case I believe that by filtering based on >> the a priori assumption that we just aren't that interested in low >> variance genes for biologic reasons (even if statistically significant >> they will have very low fold changes and thus be of questionable >> meaning) that we aren't violating the statistical underpinnings of the >> analysis. >> >> I need some help in justifying this filtering step. Does anyone know of >> a peer-reviewed reference that gives a theoretical justification for its >> use of of any empiric experiments that show that it is legit? >> >> Thanks, >> Mark >> -- >> Mark W. Kimpel MD >> Neuroinformatics >> Department of Psychiatry >> Indiana University School of Medicine >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist > W.M. Keck Center for Comparative and Functional Genomics > Roy J. Carver Biotechnology Center > University of Illinois, Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > ph: 217-244-7355 > fax: 217-265-5066 > e-mail: drnevich at uiuc.edu > -- Mark W. Kimpel MD Neuroinformatics Department of Psychiatry Indiana University School of Medicine

ADD REPLY • link 17.1 years ago Mark W Kimpel ▴ 830

Login before adding your answer.