Almost inexisting overlap of diff. expr. genes found when comparing mas5 / rma

0

Entering edit mode

Emmanuel Levy ▴ 270

@emmanuel-levy-1240

Last seen 6.4 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20050708/ 33ed5be4/attachment.pl

• 1.1k views

ADD COMMENT • link updated 18.8 years ago by Adaikalavan Ramasamy ★ 1.8k • written 18.8 years ago by Emmanuel Levy ▴ 270

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 57 minutes ago

United States

Emmanuel Levy wrote: > Dear Bioconductor community, > > I've been looking for differentially expressed genes in C. elegans after a > drug treatment. > There are 3 replicates of each condition and 2 conditions in total (WT and > Drug) > I used limma combined with either rma or mas5. I find a very very poor > overlap in the results: > > - example (i) only 24 of the 100 most differentially expressed genes > obtained using rma are found in > the 1000 most differentially expressed genes obtained using mas5 > - example (ii) only 183 genes are common to the lists of the 1000 most > differentially expressed genes > found using both methods. > (see piece of code at the end) Unfortunately, this is a very common result. We recently did a study of 7 different methods for Affy data, and found very poor overlap in the set of significant genes. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&d opt=Abstract&list_uids=15705192&query_hl=1 One problem with microarray data is the lack of 'true' measurements that can be used to objectively assess the results of any given method. Instead we are forced to judge the results based on ideas that may not be easily defended. For instance, in the above paper, we compared two different sample types using either t-tests or a Wilcoxon rank sum, and chose the method that gave the most 'differentially expressed' genes at the lowest false discovery rate. I don't think you would have to argue very strenuously that this doesn't really prove one method is better than another. We did this analysis because my colleagues argue against using the Affymetrix spike-in data to assess a method because you can always 'tune' a method to work best with the spike-in data, without having any proof that it works well at all with 'real' data. The only way I know to objectively test the different methods would be to take some samples, randomly select many (where many == thousands) genes to test using an agreed upon 'gold standard' (qRT-PCR, most likely), then analyze the samples using Affy chips and see which method correlates best with the gold standard result. Probably only take US$10,000 or so to do. In the interim the only recourse as I see it is to pick a favorite method (based on something suitably intangible) and stick with it ;-D. Best, Jim > > Either > 1/ I am missing something which I would'nt be surprised of, as my expertise > is very limited. > > In that case I am sorry for pointing out something irrelevant and thank you > in advance for telling > me what I'm missing, > > 2/ The differences in the normalization methods are really at the origin of > the observed differences. > In that case, how can I know which method is the best for my case study? > Does a helpful paper exists > which explains in simple words the strengths/weaknesses of each method? > > Thank you very much in advance for your help, > > Emmanuel > > -------------------------------------- CODE > -------------------------------------- > library(affy) > library(limma) > > # Load data into Affybatch > data = ReadAffy(widget=T) > > # Background correction / normalization > eset.rma = rma(data) > eset.mas = mas5(data) > > # Get Expression values > exp.rma = exprs(eset.rma) > exp.mas = exprs(eset.mas) > > # --- Look for differentially expressed genes using Limma package > strain = c("WT","WT","WT","Drug","Drug","Drug") > design = model.matrix(~factor(strain)) > colnames(design) = c("WT","Drug") > > fit.rma = lmFit(eset.rma,design) > fit.mas = lmFit(eset.mas,design) > > fit.rma.2 = eBayes(fit.rma) > fit.mas.2 = eBayes(fit.mas) > > top.rma = as.numeric(rownames(topTable(fit.rma.2,n=1000))) > top.mas = as.numeric(rownames(topTable(fit.mas.2,n=100))) > length(intersect(top.rma,top.mas)) > >>[1] 24 > > > top.rma = as.numeric(rownames(topTable(fit.rma.2,n=100))) > top.mas = as.numeric(rownames(topTable(fit.mas.2,n=1000))) > length(intersect(top.rma,top.mas)) > >>[1] 0 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 18.8 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Emmanuel and Jim, have you found whether the common genes are those that have the lowest between-replicate variation in the arrays among your list, or maybe those that have a signal intensity range of say between 1000 and 10000 units? In my short experience with arrays and MAS5, I have been able to validate many array results in an biological model-dependent way, using qPCR, which tells me that MAS5 can provide some good data. I haven?t performed many analyses with RMA and haven?t validated them either, but, and correct me if I am wrong, if MAS5 tends to fail at low intensities and at high intensities there is great variation, and RMA performs better at low intensities and probably similarly to MAS5 at medium to hight intensities, might the common genes be in that medium to high intensity range? Regards, David > Emmanuel Levy wrote: > > Dear Bioconductor community, > > > > I've been looking for differentially expressed genes in C. elegans after a > > drug treatment. > > There are 3 replicates of each condition and 2 conditions in total (WT and > > Drug) > > I used limma combined with either rma or mas5. I find a very very poor > > overlap in the results: > > > > - example (i) only 24 of the 100 most differentially expressed genes > > obtained using rma are found in > > the 1000 most differentially expressed genes obtained using mas5 > > - example (ii) only 183 genes are common to the lists of the 1000 most > > differentially expressed genes > > found using both methods. > > (see piece of code at the end) > > Unfortunately, this is a very common result. We recently did a study of > 7 different methods for Affy data, and found very poor overlap in the > set of significant genes. > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15705192&query_hl=1 > > One problem with microarray data is the lack of 'true' measurements that > can be used to objectively assess the results of any given method. > Instead we are forced to judge the results based on ideas that may not > be easily defended. > > For instance, in the above paper, we compared two different sample types > using either t-tests or a Wilcoxon rank sum, and chose the method that > gave the most 'differentially expressed' genes at the lowest false > discovery rate. I don't think you would have to argue very strenuously > that this doesn't really prove one method is better than another. > > We did this analysis because my colleagues argue against using the > Affymetrix spike-in data to assess a method because you can always > 'tune' a method to work best with the spike-in data, without having any > proof that it works well at all with 'real' data. > > The only way I know to objectively test the different methods would be > to take some samples, randomly select many (where many == thousands) > genes to test using an agreed upon 'gold standard' (qRT-PCR, most > likely), then analyze the samples using Affy chips and see which method > correlates best with the gold standard result. Probably only take > US$10,000 or so to do. > > In the interim the only recourse as I see it is to pick a favorite > method (based on something suitably intangible) and stick with it ;- D. > > Best, > > Jim > > > > > > > > Either > > 1/ I am missing something which I would'nt be surprised of, as my expertise > > is very limited. > > > > In that case I am sorry for pointing out something irrelevant and thank you > > in advance for telling > > me what I'm missing, > > > > 2/ The differences in the normalization methods are really at the origin of > > the observed differences. > > In that case, how can I know which method is the best for my case study? > > Does a helpful paper exists > > which explains in simple words the strengths/weaknesses of each method? > > > > Thank you very much in advance for your help, > > > > Emmanuel > > > > -------------------------------------- CODE > > -------------------------------------- > > library(affy) > > library(limma) > > > > # Load data into Affybatch > > data = ReadAffy(widget=T) > > > > # Background correction / normalization > > eset.rma = rma(data) > > eset.mas = mas5(data) > > > > # Get Expression values > > exp.rma = exprs(eset.rma) > > exp.mas = exprs(eset.mas) > > > > # --- Look for differentially expressed genes using Limma package > > strain = c("WT","WT","WT","Drug","Drug","Drug") > > design = model.matrix(~factor(strain)) > > colnames(design) = c("WT","Drug") > > > > fit.rma = lmFit(eset.rma,design) > > fit.mas = lmFit(eset.mas,design) > > > > fit.rma.2 = eBayes(fit.rma) > > fit.mas.2 = eBayes(fit.mas) > > > > top.rma = as.numeric(rownames(topTable(fit.rma.2,n=1000))) > > top.mas = as.numeric(rownames(topTable(fit.mas.2,n=100))) > > length(intersect(top.rma,top.mas)) > > > >>[1] 24 > > > > > > top.rma = as.numeric(rownames(topTable(fit.rma.2,n=100))) > > top.mas = as.numeric(rownames(topTable(fit.mas.2,n=1000))) > > length(intersect(top.rma,top.mas)) > > > >>[1] 0 > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > -- > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 18.8 years ago kfbargad@lg.ehu.es ▴ 260

0

Entering edit mode

Dear David and BioC community, I'd like to be able to give ou an answer but unfortunatly I don't have enough expertise. Still I was wondering about something: there has been a paper out showing that the "no global change in expression" assumption widely used when normalising is not very accurate PMID 15837421 They basicaly use spike in data. Altough I agree with James on the following "" We did this analysis because my colleagues argue against using the Affymetrix spike-in data to assess a method because you can always 'tune' a method to work best with the spike-in data, without having any proof that it works well at all with 'real' data. """ In this study they spiked more that 1000 genes. I would tend to think that with such a high number the problem of 'tuning' James is pointed out before would be far less pronounced. Do you think that thousands of spiked genes will be part of the next generation of affymetrix arrays? Emmanuel

ADD REPLY • link 18.8 years ago Emmanuel Levy ▴ 270

0

Entering edit mode

Adaikalavan Ramasamy ★ 1.8k

@adaikalavan-ramasamy-675

Last seen 9.7 years ago

Yes we often see poor overlaps. A 40 - 50 % overlap is considered pretty good but rare unless you are considering the top 5 genes in both list or something silly like that. To make a fair comparison, try comparing the lists when they are both filtered by the same p-value cutoff or statistics rather than arbitrarily choosing a numbers. Further, two minor cosmetic points about your code 1) If you look at your design matrix from strain = c("WT","WT","WT","Drug","Drug","Drug") design = model.matrix(~factor(strain)) colnames(design) = c("WT","Drug") design WT Drug 1 1 1 2 1 1 3 1 1 4 1 0 5 1 0 6 1 0 the first column represents an intercept not WT. To get the correct interpretation, you need to change the second line to design = model.matrix(~ -1 + factor(strain) ) 2) You do not need the force the rownames to numeric using as.numeric() since intersect happily works with characters. x <- c("a", "b", "c") y <- c("b", "c", "d") intersect(x,y) [1] "b" "c" But I do not think either of these point change your results. On Fri, 2005-07-08 at 18:18 +0100, Emmanuel Levy wrote: > Dear Bioconductor community, > > I've been looking for differentially expressed genes in C. elegans after a > drug treatment. > There are 3 replicates of each condition and 2 conditions in total (WT and > Drug) > I used limma combined with either rma or mas5. I find a very very poor > overlap in the results: > > - example (i) only 24 of the 100 most differentially expressed genes > obtained using rma are found in > the 1000 most differentially expressed genes obtained using mas5 > - example (ii) only 183 genes are common to the lists of the 1000 most > differentially expressed genes > found using both methods. > (see piece of code at the end) > > Either > 1/ I am missing something which I would'nt be surprised of, as my expertise > is very limited. > > In that case I am sorry for pointing out something irrelevant and thank you > in advance for telling > me what I'm missing, > > 2/ The differences in the normalization methods are really at the origin of > the observed differences. > In that case, how can I know which method is the best for my case study? > Does a helpful paper exists > which explains in simple words the strengths/weaknesses of each method? > > Thank you very much in advance for your help, > > Emmanuel > > -------------------------------------- CODE > -------------------------------------- > library(affy) > library(limma) > > # Load data into Affybatch > data = ReadAffy(widget=T) > > # Background correction / normalization > eset.rma = rma(data) > eset.mas = mas5(data) > > # Get Expression values > exp.rma = exprs(eset.rma) > exp.mas = exprs(eset.mas) > > # --- Look for differentially expressed genes using Limma package > strain = c("WT","WT","WT","Drug","Drug","Drug") > design = model.matrix(~factor(strain)) > colnames(design) = c("WT","Drug") > > fit.rma = lmFit(eset.rma,design) > fit.mas = lmFit(eset.mas,design) > > fit.rma.2 = eBayes(fit.rma) > fit.mas.2 = eBayes(fit.mas) > > top.rma = as.numeric(rownames(topTable(fit.rma.2,n=1000))) > top.mas = as.numeric(rownames(topTable(fit.mas.2,n=100))) > length(intersect(top.rma,top.mas)) > > [1] 24 > > top.rma = as.numeric(rownames(topTable(fit.rma.2,n=100))) > top.mas = as.numeric(rownames(topTable(fit.mas.2,n=1000))) > length(intersect(top.rma,top.mas)) > > [1] 0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 18.8 years ago Adaikalavan Ramasamy ★ 1.8k

Login before adding your answer.