Siggenes/SAM Error

0

Entering edit mode

D F ▴ 10

@d-f-2890

Last seen 9.6 years ago

Hi, I'm new to Bioconductor and R and I apologize if this is absolutely obvious (as it must be), however, the issue escapes me. I wish to run SAM on a set of 5 replicates over two classes. I've loaded normalized and logged expression data into a matrix (called data) and ran the commands below, which results in an error. Could you please advise on what the problem might be? Thanks. De > dim(data) [1] 22690 5 > dim(gene_names) [1] 22690 1 > rownames(data) <- as.matrix(gene_names) > sam.out <- sam(data, c(0,0,1,1,1)) We're doing 10 complete permutations Error in d.perm[, i] <- sort(tmp$t.num/(tmp$t.denum + s0)) : number of items to replace is not a multiple of replacement length In addition: Warning message: There are 127 variables with zero variance. These variables are removed, and their d-values are set to NA. [[alternative HTML version deleted]]

• 1.2k views

ADD COMMENT • link updated 15.8 years ago by Holger Schwender ▴ 900 • written 15.8 years ago by D F ▴ 10

0

Entering edit mode

Holger Schwender ▴ 900

@holger-schwender-344

Last seen 9.6 years ago

Pretty strange bug. I will fix it in the next few days. Should then be available in the devel section of Bioconductor. Should be available in siggenes version 1.15.1 and later. Best, Holger -------- Original-Nachricht -------- > Datum: Wed, 2 Jul 2008 15:55:29 -0700 > Von: "D F" <dflab1 at="" gmail.com=""> > An: bioconductor at stat.math.ethz.ch > Betreff: [BioC] Siggenes/SAM Error > Hi, > > I'm new to Bioconductor and R and I apologize if this is absolutely > obvious > (as it must be), however, the issue escapes me. I wish to run SAM on a set > of 5 replicates over two classes. I've loaded normalized and logged > expression data into a matrix (called data) and ran the commands below, > which results in an error. Could you please advise on what the problem > might > be? > > Thanks. > De > > > dim(data) > [1] 22690 5 > > dim(gene_names) > [1] 22690 1 > > rownames(data) <- as.matrix(gene_names) > > sam.out <- sam(data, c(0,0,1,1,1)) > > We're doing 10 complete permutations > > Error in d.perm[, i] <- sort(tmp$t.num/(tmp$t.denum + s0)) : > number of items to replace is not a multiple of replacement length > > In addition: Warning message: > There are 127 variables with zero variance. These variables are removed, > and their d-values are set to NA. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor --

ADD COMMENT • link 15.8 years ago Holger Schwender ▴ 900

0

Entering edit mode

How similar is siggenes to the Stanford SAM with the Excel front end? Thanks! Tom

ADD REPLY • link 15.8 years ago Thomas Hampton ▴ 750

0

Entering edit mode

Hi Tom, not sure how often I have answered this question here and in other forums. But okay once again: The defaults of siggenes and Excel SAM are a bit different: - In siggenes, a moderated Welch's t-statistic is computed by default, whereas a moderated version of the ordinary t-statistic assuming equal group variances is used in Excel SAM. Set var.equal=TRUE in sam to use the ordinary t-statistic (only necessary if the sizes of the groups differ). - In siggenes, the mean number of falsely called genes (warning: this is *not* the expected number of false positives) is computed, whereas the median number is used by Excel SAM. Set med=TRUE in sam to use the median number. - In siggenes, a natural cubic spline based approach is used to estimate pi0, whereas Excel SAM uses an adhoc estimate. Set lambda=0.5 in sam to use this adhoc estimate. - Even though I have implemented the computation of the fudge factor s0 exactly as described in the Excel SAM manual, the value of s0 usually differs between siggenes and Excel. Not sure why. - In siggenes s0=0 is also a choice for the fudge factor. In the old version of Excel SAM it is not. Not sure about the new version. - Have not found a description on how the q-values are estimated in Excel SAM. The values of the q-values usually differ between siggenes and Excel. In siggenes, the computation of the q-values is implemented in virtually the same way as in John Storey's R package qvalue such that q-values are typically the same. They only differ when there are tied p-values, since siggenes handles ties a bit different (in my opinion more correctly) than John's function qvalue. - The same seed for the random number generator will not lead to the same permutations of the response. Best, Holger -------- Original-Nachricht -------- > Datum: Thu, 3 Jul 2008 09:21:41 -0400 > Von: Thomas Hampton <thomas.h.hampton at="" dartmouth.edu=""> > An: "Holger Schwender" <holger.schw at="" gmx.de=""> > CC: bioc <bioconductor at="" stat.math.ethz.ch=""> > Betreff: [BioC] Siggenes/SAM vs Excel > How similar is siggenes to the Stanford SAM with the Excel front end? > > Thanks! > > Tom --

ADD REPLY • link 15.8 years ago Holger Schwender ▴ 900

0

Entering edit mode

Thanks very much for your response. It helped a lot. One of the things it points out to me (yet again) is that the essential algorithm has two parts, one to make it so that dinky variances don't create a bunch of "significant" genes out of small variance alone. Second, SAM goes to substantial lengths trying to control the false positive rate without simply throwing all the significant genes out. Regardless of tweaks in the defaults, either SAM (Excel) siggenes or samr from CRAN should do an excellent job of identifying differentially expressed genes. On my data, SAM certainly does a great job compared to a t test. However, like Guo of MAQC fame, I get better concordance between replicates with a fold change and loose p value cutoff. This seems far too simple to be better than SAM, but there you have it. Bioc's RankProd also gives me good concordance. It does . however, order its gene lists by straight fold change... Perhaps SAM lists ordered by fold change and cut off at a certain FDR would create optimal concordance. We do not really expect FDR to be more reproducible than fold change, do we? Anyway, thanks for all your input into this list. yours, Tom On Jul 3, 2008, at 11:04 PM, Holger Schwender wrote: > Hi Tom, > > not sure how often I have answered this question here and in other > forums. But okay once again: The defaults of siggenes and Excel SAM > are a bit different: > > - In siggenes, a moderated Welch's t-statistic is computed by > default, whereas a moderated version of the ordinary t-statistic > assuming equal group variances is used in Excel SAM. Set > var.equal=TRUE in sam to use the ordinary t-statistic (only > necessary if the sizes of the groups differ). > > - In siggenes, the mean number of falsely called genes (warning: > this is *not* the expected number of false positives) is computed, > whereas the median number is used by Excel SAM. Set med=TRUE in sam > to use the median number. > > - In siggenes, a natural cubic spline based approach is used to > estimate pi0, whereas Excel SAM uses an adhoc estimate. Set > lambda=0.5 in sam to use this adhoc estimate. > > - Even though I have implemented the computation of the fudge > factor s0 exactly as described in the Excel SAM manual, the value > of s0 usually differs between siggenes and Excel. Not sure why. > > - In siggenes s0=0 is also a choice for the fudge factor. In the > old version of Excel SAM it is not. Not sure about the new version. > > - Have not found a description on how the q-values are estimated in > Excel SAM. The values of the q-values usually differ between > siggenes and Excel. In siggenes, the computation of the q-values is > implemented in virtually the same way as in John Storey's R package > qvalue such that q-values are typically the same. They only differ > when there are tied p-values, since siggenes handles ties a bit > different (in my opinion more correctly) than John's function qvalue. > > - The same seed for the random number generator will not lead to > the same permutations of the response. > > Best, > Holger > > > > -------- Original-Nachricht -------- >> Datum: Thu, 3 Jul 2008 09:21:41 -0400 >> Von: Thomas Hampton <thomas.h.hampton at="" dartmouth.edu=""> >> An: "Holger Schwender" <holger.schw at="" gmx.de=""> >> CC: bioc <bioconductor at="" stat.math.ethz.ch=""> >> Betreff: [BioC] Siggenes/SAM vs Excel > >> How similar is siggenes to the Stanford SAM with the Excel front >> end? >> >> Thanks! >> >> Tom > > -- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD REPLY • link 15.8 years ago Thomas Hampton ▴ 750

0

Entering edit mode

First of all, siggenes has nothing to do with Excel SAM and samr. So "siggenes (Excel) SAM" is wrong. I have implemented the original version of siggenes (a much worse version of the current siggenes) for my diploma thesis at the University of Dortmund, Germany, where both Excel SAM and samr are written by the guys from Stanford that have proposed SAM. If you would like to order your genes by the fold changes you are free to do so. The output of sam, say sam.out, provides all the information that you need to do so. For example, sam.out at fold contains the fold changes, sam.out at d the test statistics, sam.out at qvalue the q-values,... See ?SAM for all the slots sam.out has. Holger -------- Original-Nachricht -------- > Datum: Fri, 4 Jul 2008 10:41:46 -0400 > Von: Thomas Hampton <thomas.h.hampton at="" dartmouth.edu=""> > An: "Holger Schwender" <holger.schw at="" gmx.de=""> > CC: bioconductor at stat.math.ethz.ch > Betreff: Re: [BioC] Siggenes/SAM vs Excel > Thanks very much for your response. It helped a lot. > > One of the things it points out to me (yet again) is that the > essential algorithm has two parts, one to > make it so that dinky variances don't create a bunch of "significant" > genes out of small variance alone. > > Second, SAM goes to substantial lengths trying to control the false > positive rate without > simply throwing all the significant genes out. > > Regardless of tweaks in the defaults, either SAM (Excel) siggenes or > samr from CRAN should do an excellent job > of identifying differentially expressed genes. > > On my data, SAM certainly does a great job compared to a t test. > However, like Guo of MAQC fame, I get better > concordance between replicates with a fold change and loose p value > cutoff. This seems far too simple > to be better than SAM, but there you have it. Bioc's RankProd also > gives me good concordance. It does . however, order > its gene lists by straight fold change... > > Perhaps SAM lists ordered by fold change and cut off at a certain > FDR would create optimal concordance. We do not really expect FDR to be > more reproducible than fold change, do we? > > Anyway, thanks for all your input into this list. > > yours, > > Tom > > > > > > On Jul 3, 2008, at 11:04 PM, Holger Schwender wrote: > > > Hi Tom, > > > > not sure how often I have answered this question here and in other > > forums. But okay once again: The defaults of siggenes and Excel SAM > > are a bit different: > > > > - In siggenes, a moderated Welch's t-statistic is computed by > > default, whereas a moderated version of the ordinary t-statistic > > assuming equal group variances is used in Excel SAM. Set > > var.equal=TRUE in sam to use the ordinary t-statistic (only > > necessary if the sizes of the groups differ). > > > > - In siggenes, the mean number of falsely called genes (warning: > > this is *not* the expected number of false positives) is computed, > > whereas the median number is used by Excel SAM. Set med=TRUE in sam > > to use the median number. > > > > - In siggenes, a natural cubic spline based approach is used to > > estimate pi0, whereas Excel SAM uses an adhoc estimate. Set > > lambda=0.5 in sam to use this adhoc estimate. > > > > - Even though I have implemented the computation of the fudge > > factor s0 exactly as described in the Excel SAM manual, the value > > of s0 usually differs between siggenes and Excel. Not sure why. > > > > - In siggenes s0=0 is also a choice for the fudge factor. In the > > old version of Excel SAM it is not. Not sure about the new version. > > > > - Have not found a description on how the q-values are estimated in > > Excel SAM. The values of the q-values usually differ between > > siggenes and Excel. In siggenes, the computation of the q-values is > > implemented in virtually the same way as in John Storey's R package > > qvalue such that q-values are typically the same. They only differ > > when there are tied p-values, since siggenes handles ties a bit > > different (in my opinion more correctly) than John's function qvalue. > > > > - The same seed for the random number generator will not lead to > > the same permutations of the response. > > > > Best, > > Holger > > > > > > > > -------- Original-Nachricht -------- > >> Datum: Thu, 3 Jul 2008 09:21:41 -0400 > >> Von: Thomas Hampton <thomas.h.hampton at="" dartmouth.edu=""> > >> An: "Holger Schwender" <holger.schw at="" gmx.de=""> > >> CC: bioc <bioconductor at="" stat.math.ethz.ch=""> > >> Betreff: [BioC] Siggenes/SAM vs Excel > > > >> How similar is siggenes to the Stanford SAM with the Excel front > >> end? > >> > >> Thanks! > >> > >> Tom > > > > -- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/ > > gmane.science.biology.informatics.conductor -- Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf at gmx

ADD REPLY • link 15.8 years ago Holger Schwender ▴ 900

0

Entering edit mode

>On my data, SAM certainly does a great job compared to a t test. >However, like Guo of MAQC fame, I get better >concordance between replicates with a fold change and loose p value >cutoff. This seems far too simple >to be better than SAM, but there you have it. Bioc's RankProd also >gives me good concordance. It does . however, order >its gene lists by straight fold change... I cringed a little when I read the MAQC's overall conclusion that (given replicates) FC or FC + loose p-values gives the best concordance between gene lists from replicate experiments, different pre-processing methods and/or different platforms. It seems like it gives permission to almost completely ignore statistical treatment of the data and go back to just using FC to identify differentially expressed genes! However, I don't think they adequately pointed out that "concordance between gene lists" doesn't really mean those genes are "all the truly DE genes" but rather "only those large FCers that can be detected no matter what measuring method or data pre-processing algorithm is used." It's also worth pointing out that MAQC used TECHNICAL replicates only, and compared very disparate mRNA populations - total human RNA vs. human brain RNA, iirc. In real microarray experiments, one uses (hopefully!) biological replicates and compares mRNA populations that are much more similar. Of course, both of these just make it more difficult to detect differences, and the MAQC's recommendation to use FC to find "repeatable" lists would be even more applicable in this situation! However, most researchers that I work with have way too simplistic a view of what "gene expression" actually is and how it is measured; they assume there is one "real" level of expression that indicates the "real" protein level, and that all methods should be able to measure this "real" level, else they are bad methods. Case in point: I had a client that was testing a KO's effect on expression patterns. In the Affymetrix microarray data, the KO gene was actually UP-regulated in the KO samples! They of course got all upset over this because they had confirmed the KO with qPCR. It turns out they had only deleted two exons and their PCR primers measured this region of the transcript. However, the Affy probes were for a different exon that was still being transcribed. The lack of functional protein likely caused the samples to up-regulate mRNA production. The moral of the story: it is possible for different methods of measuring mRNA levels to give different answers, YET STILL BOTH BE CORRECT because they are usually measuring different sections of the transcript with different methods. I tell this story every time a researcher tells me they can't "confirm" the microarray results via qPCR :) So it's fine to use FC or FC + loose p-value if you only want to find genes with very large expression differences, and not other genes which also may have important expression differences. Genes with different expression patterns according to different measuring methods may be very interesting genes after all! Jenny > Perhaps SAM lists ordered by fold change and cut off at a certain >FDR would create optimal concordance. We do not really expect FDR to be >more reproducible than fold change, do we? > >Anyway, thanks for all your input into this list. > >yours, > >Tom > > > > > >On Jul 3, 2008, at 11:04 PM, Holger Schwender wrote: > >>Hi Tom, >> >>not sure how often I have answered this question here and in other >>forums. But okay once again: The defaults of siggenes and Excel SAM >>are a bit different: >> >>- In siggenes, a moderated Welch's t-statistic is computed by >>default, whereas a moderated version of the ordinary t-statistic >>assuming equal group variances is used in Excel SAM. Set >>var.equal=TRUE in sam to use the ordinary t-statistic (only >>necessary if the sizes of the groups differ). >> >>- In siggenes, the mean number of falsely called genes (warning: >>this is *not* the expected number of false positives) is computed, >>whereas the median number is used by Excel SAM. Set med=TRUE in sam >>to use the median number. >> >>- In siggenes, a natural cubic spline based approach is used to >>estimate pi0, whereas Excel SAM uses an adhoc estimate. Set >>lambda=0.5 in sam to use this adhoc estimate. >> >>- Even though I have implemented the computation of the fudge >>factor s0 exactly as described in the Excel SAM manual, the value >>of s0 usually differs between siggenes and Excel. Not sure why. >> >>- In siggenes s0=0 is also a choice for the fudge factor. In the >>old version of Excel SAM it is not. Not sure about the new version. >> >>- Have not found a description on how the q-values are estimated in >>Excel SAM. The values of the q-values usually differ between >>siggenes and Excel. In siggenes, the computation of the q-values is >>implemented in virtually the same way as in John Storey's R package >>qvalue such that q-values are typically the same. They only differ >>when there are tied p-values, since siggenes handles ties a bit >>different (in my opinion more correctly) than John's function qvalue. >> >>- The same seed for the random number generator will not lead to >>the same permutations of the response. >> >>Best, >>Holger >> >> >> >>-------- Original-Nachricht -------- >>>Datum: Thu, 3 Jul 2008 09:21:41 -0400 >>>Von: Thomas Hampton <thomas.h.hampton at="" dartmouth.edu=""> >>>An: "Holger Schwender" <holger.schw at="" gmx.de=""> >>>CC: bioc <bioconductor at="" stat.math.ethz.ch=""> >>>Betreff: [BioC] Siggenes/SAM vs Excel >> >>>How similar is siggenes to the Stanford SAM with the Excel front >>>end? >>> >>>Thanks! >>> >>>Tom >> >>-- >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: http://news.gmane.org/ >>gmane.science.biology.informatics.conductor > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at illinois.edu

ADD REPLY • link 15.8 years ago Jenny Drnevich ★ 2.0k

Login before adding your answer.