Siggenes SAM analysis: log2 transformation and Understanding output

0

Entering edit mode

David Westergaard ▴ 280

@david-westergaard-5119

Last seen 9.6 years ago

Hello, I am currently working on a data set about kiwi consumption for my bachelors project. The data is available at http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030 I'm abit confused as to how to interpret the output parameters, specifically p0. I've run the following code: dataset <- read.table("OAS_RMA.txt",header=TRUE) controls <- cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,data set$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CE L78.1,dataset$CEL9.1,dataset$CEL92.1) experiments <- cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,da taset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$ CEL57.1,dataset$CEL7.1) library('siggenes') datamatrix <- matrix(cbind(controls,experiments),ncol=19) y <- rep(0,19) y[11:19] <- 1 gene_names <- as.character(dataset$Hybridization.REF) sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345) Output: AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances s0 = 0 Number of permutations: 100 MEAN number of falsely called variables is computed. Delta p0 False Called FDR cutlow cutup j2 j1 1 0.1 0.634 28335.89 37013 0.4851 -1.058 0.354 9709 27372 2 0.5 0.634 11200.82 21273 0.3336 -2.271 0.910 2447 35850 3 0.9 0.634 249.38 1522 0.1038 -3.374 3.088 541 53695 4 1.3 0.634 9.67 134 0.0457 -4.402 5.577 127 54669 5 1.7 0.634 0.69 20 0.0219 -5.596 Inf 20 54676 6 2.1 0.634 0 1 0 -9.072 Inf 1 54676 7 2.5 0.634 0 1 0 -9.072 Inf 1 54676 8 2.9 0.634 0 1 0 -9.072 Inf 1 54676 9 3.3 0.634 0 1 0 -9.072 Inf 1 54676 10 3.7 0.634 0 0 0 -Inf Inf 0 54676 I'm using the rand parameter because results seems to vary a bit. p0 is in this case 0.634, and I'm not sure how to interpret this. From literature, this is described as "Prior probability that a gene is not differentially expressed" - What does this exactly mean? Does this imply, that there is a ~63% percent chance, that the genes in question, are actually NOT differentially expressed? I've also found some varying sources saying that it is a good idea to log2 transform data before inputting into SAM. Does this still apply, and if so, why? Best Regards, David Westergaard Undergraduate student Technical University of Denmark

• 1.5k views

ADD COMMENT • link updated 12.2 years ago by James W. MacDonald 65k • written 12.2 years ago by David Westergaard ▴ 280

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 16 hours ago

United States

Hi David, On 2/15/2012 8:30 AM, David Westergaard wrote: > Hello, > > I am currently working on a data set about kiwi consumption for my > bachelors project. The data is available at > http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030 > > I'm abit confused as to how to interpret the output parameters, > specifically p0. I've run the following code: > > dataset<- read.table("OAS_RMA.txt",header=TRUE) > controls<- cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dat aset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$C EL78.1,dataset$CEL9.1,dataset$CEL92.1) > experiments<- cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,d ataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset $CEL57.1,dataset$CEL7.1) > > library('siggenes') > datamatrix<- matrix(cbind(controls,experiments),ncol=19) > y<- rep(0,19) > y[11:19]<- 1 > gene_names<- as.character(dataset$Hybridization.REF) > sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345) > > Output: > AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances > > s0 = 0 > > Number of permutations: 100 > > MEAN number of falsely called variables is computed. > > Delta p0 False Called FDR cutlow cutup j2 j1 > 1 0.1 0.634 28335.89 37013 0.4851 -1.058 0.354 9709 27372 > 2 0.5 0.634 11200.82 21273 0.3336 -2.271 0.910 2447 35850 > 3 0.9 0.634 249.38 1522 0.1038 -3.374 3.088 541 53695 > 4 1.3 0.634 9.67 134 0.0457 -4.402 5.577 127 54669 > 5 1.7 0.634 0.69 20 0.0219 -5.596 Inf 20 54676 > 6 2.1 0.634 0 1 0 -9.072 Inf 1 54676 > 7 2.5 0.634 0 1 0 -9.072 Inf 1 54676 > 8 2.9 0.634 0 1 0 -9.072 Inf 1 54676 > 9 3.3 0.634 0 1 0 -9.072 Inf 1 54676 > 10 3.7 0.634 0 0 0 -Inf Inf 0 54676 > > I'm using the rand parameter because results seems to vary a bit. p0 > is in this case 0.634, and I'm not sure how to interpret this. From > literature, this is described as "Prior probability that a gene is not > differentially expressed" - What does this exactly mean? Does this > imply, that there is a ~63% percent chance, that the genes in > question, are actually NOT differentially expressed? It means that about 63% of your genes appear to be not differentially expressed. So if you choose a gene at random, there is a 63% probability that you will choose one that isn't differentially expressed. However, depending on the value of Delta that you choose, the expectation is that a far fewer percentage of the genes selected will be differentially expressed. In other words, you are trying to grab genes with a higher probability of differential expression, and you are then estimating what percentage of those genes are still likely false positives (e.g., if you choose a Delta of 1.3, you will get 134 significant genes, and will expect that about 10 of those will be false positives). > I've also found some varying sources saying that it is a good idea to > log2 transform data before inputting into SAM. Does this still apply, > and if so, why? This is because the t-test is based on means, which are not very robust to outliers. Gene expression data tend to have a strong right skew, meaning that most of the data are within a certain range, but there are some values much higher. If you take logs, it tends to minimize the skew, so the large values have less of an effect (on the linear scale, expression values range from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter what base you use, but people have tended to use log base 2 because then a difference of 1 indicates a two-fold difference on the linear scale. Best, Jim > > Best Regards, > > David Westergaard > Undergraduate student > Technical University of Denmark > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.2 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, Thank you for your thorough answers. Just to make it absolutely clear: p0 is the chance that a randomly picked gene is differentially expressed, when picking a gene from the WHOLE data set? Secondly, should I be wary of p0 values below/above some threshold? I am working on multiple data sets, and p0 seems to vary from 0.39-0.80 Lastly, is there an option to specify a FDR threshold? I am running through multiple data sets, and I would like to automate it, instead of having to looking at a table for each one individually. Best, David 2012/2/16 James W. MacDonald <jmacdon at="" uw.edu="">: > Hi David, > > > On 2/15/2012 8:30 AM, David Westergaard wrote: >> >> Hello, >> >> I am currently working on a data set about kiwi consumption for my >> bachelors project. The data is available at >> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030 >> >> I'm abit confused as to how to interpret the output parameters, >> specifically p0. I've run the following code: >> >> dataset<- read.table("OAS_RMA.txt",header=TRUE) >> controls<- >> cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25 .1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dat aset$CEL9.1,dataset$CEL92.1) >> experiments<- >> cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31. 1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,data set$CEL7.1) >> >> library('siggenes') >> datamatrix<- matrix(cbind(controls,experiments),ncol=19) >> y<- rep(0,19) >> y[11:19]<- 1 >> gene_names<- as.character(dataset$Hybridization.REF) >> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345) >> >> Output: >> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances >> >> ?s0 = 0 >> >> ?Number of permutations: 100 >> >> ?MEAN number of falsely called variables is computed. >> >> ? ?Delta ? ?p0 ? ?False Called ? ?FDR cutlow cutup ? j2 ? ?j1 >> 1 ? ?0.1 0.634 28335.89 ?37013 0.4851 -1.058 0.354 9709 27372 >> 2 ? ?0.5 0.634 11200.82 ?21273 0.3336 -2.271 0.910 2447 35850 >> 3 ? ?0.9 0.634 ? 249.38 ? 1522 0.1038 -3.374 3.088 ?541 53695 >> 4 ? ?1.3 0.634 ? ? 9.67 ? ?134 0.0457 -4.402 5.577 ?127 54669 >> 5 ? ?1.7 0.634 ? ? 0.69 ? ? 20 0.0219 -5.596 ? Inf ? 20 54676 >> 6 ? ?2.1 0.634 ? ? ? ?0 ? ? ?1 ? ? ?0 -9.072 ? Inf ? ?1 54676 >> 7 ? ?2.5 0.634 ? ? ? ?0 ? ? ?1 ? ? ?0 -9.072 ? Inf ? ?1 54676 >> 8 ? ?2.9 0.634 ? ? ? ?0 ? ? ?1 ? ? ?0 -9.072 ? Inf ? ?1 54676 >> 9 ? ?3.3 0.634 ? ? ? ?0 ? ? ?1 ? ? ?0 -9.072 ? Inf ? ?1 54676 >> 10 ? 3.7 0.634 ? ? ? ?0 ? ? ?0 ? ? ?0 ? -Inf ? Inf ? ?0 54676 >> >> I'm using the rand parameter because results seems to vary a bit. p0 >> is in this case 0.634, and I'm not sure how to interpret this. From >> literature, this is described as "Prior probability that a gene is not >> differentially expressed" - What does this exactly mean? Does this >> imply, that there is a ~63% percent chance, that the genes in >> question, are actually NOT differentially expressed? > > > It means that about 63% of your genes appear to be not differentially > expressed. So if you choose a gene at random, there is a 63% probability > that you will choose one that isn't differentially expressed. > > However, depending on the value of Delta that you choose, the expectation is > that a far fewer percentage of the genes selected will be differentially > expressed. In other words, you are trying to grab genes with a higher > probability of differential expression, and you are then estimating what > percentage of those genes are still likely false positives (e.g., if you > choose a Delta of 1.3, you will get 134 significant genes, and will expect > that about 10 of those will be false positives). > > >> I've also found some varying sources saying that it is a good idea to >> log2 transform data before inputting into SAM. Does this still apply, >> and if so, why? > > > This is because the t-test is based on means, which are not very robust to > outliers. Gene expression data tend to have a strong right skew, meaning > that most of the data are within a certain range, but there are some values > much higher. If you take logs, it tends to minimize the skew, so the large > values have less of an effect (on the linear scale, expression values range > from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter what > base you use, but people have tended to use log base 2 because then a > difference of 1 indicates a two-fold difference on the linear scale. > > Best, > > Jim > > >> >> Best Regards, >> >> David Westergaard >> Undergraduate student >> Technical University of Denmark >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago David Westergaard ▴ 280

0

Entering edit mode

Hi David, On 2/17/2012 8:46 AM, David Westergaard wrote: > Hi Jim, > > Thank you for your thorough answers. > > Just to make it absolutely clear: p0 is the chance that a randomly > picked gene is differentially expressed, when picking a gene from the > WHOLE data set? Yes. In Bayesian statistics there is the concept of a prior, which is the estimate for a particular parameter that one usually comes up with by using prior knowledge of the underlying process being studied. The general idea being that your new observations shouldn't stray too far from this prior estimate. In SAM, as in limma, what we are doing is estimating the prior from all genes under consideration (hence it is called an empirical Bayesian method). Since we don't really have prior knowledge of the underlying process, the idea is to compute some estimate using all genes, and use that as our prior. So the prior in your case is 0.63, which is the estimated proportion of genes in your study that are not differentially expressed. So by definition, if you randomly choose a gene, there is an estimated 63% probability that it isn't differentially expressed. > Secondly, should I be wary of p0 values below/above some threshold? I > am working on multiple data sets, and p0 seems to vary from 0.39-0.80 > Lastly, is there an option to specify a FDR threshold? I am running > through multiple data sets, and I would like to automate it, instead > of having to looking at a table for each one individually. I wouldn't worry too much about the p0 values, unless they vary from what you expect. For instance, if you are doing an experiment where you expect lots of genes to be affected, and you get a p0 value of 0.8, then you might wonder why siggenes disagrees with what you expect. But that probably points to a problem with your expectation or the samples/data. I don't see how you could automate things with siggenes. It is designed to be run interactively. <shameless plug=""> I hate to point people away from siggenes, because I think Holger Schwender has done a great job with the package, and continues to support it, many years after finishing his Masters degree. However, I have worked for years in microarray core facilities, and have a whole suite of functions designed to automate the processing of microarray analyses. The catch here is that these functions rely on the limma package, which estimates FDR using p.adjust(), rather than via permutation. So if you are interested in automation, particularly automating the annotation/output side of things, take a look at affycoretools. </shameless> Best, Jim > > Best, > David > > > > 2012/2/16 James W. MacDonald<jmacdon at="" uw.edu="">: >> Hi David, >> >> >> On 2/15/2012 8:30 AM, David Westergaard wrote: >>> Hello, >>> >>> I am currently working on a data set about kiwi consumption for my >>> bachelors project. The data is available at >>> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030 >>> >>> I'm abit confused as to how to interpret the output parameters, >>> specifically p0. I've run the following code: >>> >>> dataset<- read.table("OAS_RMA.txt",header=TRUE) >>> controls<- >>> cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL2 5.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,da taset$CEL9.1,dataset$CEL92.1) >>> experiments<- >>> cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31 .1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dat aset$CEL7.1) >>> >>> library('siggenes') >>> datamatrix<- matrix(cbind(controls,experiments),ncol=19) >>> y<- rep(0,19) >>> y[11:19]<- 1 >>> gene_names<- as.character(dataset$Hybridization.REF) >>> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345) >>> >>> Output: >>> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances >>> >>> s0 = 0 >>> >>> Number of permutations: 100 >>> >>> MEAN number of falsely called variables is computed. >>> >>> Delta p0 False Called FDR cutlow cutup j2 j1 >>> 1 0.1 0.634 28335.89 37013 0.4851 -1.058 0.354 9709 27372 >>> 2 0.5 0.634 11200.82 21273 0.3336 -2.271 0.910 2447 35850 >>> 3 0.9 0.634 249.38 1522 0.1038 -3.374 3.088 541 53695 >>> 4 1.3 0.634 9.67 134 0.0457 -4.402 5.577 127 54669 >>> 5 1.7 0.634 0.69 20 0.0219 -5.596 Inf 20 54676 >>> 6 2.1 0.634 0 1 0 -9.072 Inf 1 54676 >>> 7 2.5 0.634 0 1 0 -9.072 Inf 1 54676 >>> 8 2.9 0.634 0 1 0 -9.072 Inf 1 54676 >>> 9 3.3 0.634 0 1 0 -9.072 Inf 1 54676 >>> 10 3.7 0.634 0 0 0 -Inf Inf 0 54676 >>> >>> I'm using the rand parameter because results seems to vary a bit. p0 >>> is in this case 0.634, and I'm not sure how to interpret this. From >>> literature, this is described as "Prior probability that a gene is not >>> differentially expressed" - What does this exactly mean? Does this >>> imply, that there is a ~63% percent chance, that the genes in >>> question, are actually NOT differentially expressed? >> >> It means that about 63% of your genes appear to be not differentially >> expressed. So if you choose a gene at random, there is a 63% probability >> that you will choose one that isn't differentially expressed. >> >> However, depending on the value of Delta that you choose, the expectation is >> that a far fewer percentage of the genes selected will be differentially >> expressed. In other words, you are trying to grab genes with a higher >> probability of differential expression, and you are then estimating what >> percentage of those genes are still likely false positives (e.g., if you >> choose a Delta of 1.3, you will get 134 significant genes, and will expect >> that about 10 of those will be false positives). >> >> >>> I've also found some varying sources saying that it is a good idea to >>> log2 transform data before inputting into SAM. Does this still apply, >>> and if so, why? >> >> This is because the t-test is based on means, which are not very robust to >> outliers. Gene expression data tend to have a strong right skew, meaning >> that most of the data are within a certain range, but there are some values >> much higher. If you take logs, it tends to minimize the skew, so the large >> values have less of an effect (on the linear scale, expression values range >> from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter what >> base you use, but people have tended to use log base 2 because then a >> difference of 1 indicates a two-fold difference on the linear scale. >> >> Best, >> >> Jim >> >> >>> Best Regards, >>> >>> David Westergaard >>> Undergraduate student >>> Technical University of Denmark >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, hi David, thanks a lot, Jim, for answering these questions. Just one or two comments: > > Lastly, is there an option to specify a FDR threshold? I am running > > through multiple data sets, and I would like to automate it, instead > > of having to looking at a table for each one individually. > > I don't see how you could automate things with siggenes. It is designed > to be run interactively. > There exists a function called findDelta in siggenes, which allows you to find the value of delta and thus the set of genes for which the FDR is below some specified threshold. So if you really wanna automate your analysis by specifyig a specific threshold for the FDR (which is actually not really the idea behind SAM), then you can do this by using findDelta. (Alternatively, you can use the q-values that are computed for all genes -- available via the slot q.value, i.e. sam.out at q.value -- to select all genes with a q-value, i.e. FDR adjusted p-value, below some threshold. Please note the set of genes might slightly differ between the SAM FDR and the q-value approach, mainly because in the former approach the thresholds are not necessary symmetric to the origin.) > limma package, which estimates FDR using p.adjust(), rather than via > permutation. So if you are interested in automation, particularly > automating the annotation/output side of things, take a look at > affycoretools. > </shameless> Don't think that this is a shameless plug. Just some nice hints to other great, helpful packages. Just wanted to note that you can also use the results of a limma analysis in siggenes. There exists a function called limma2sam in siggenes which allows you to perform a SAM analysis with the limma statistic. Moreover, SAM does not really rely on permutations. E.g., all the stuff in siggenes for SNPs uses asymptotic null distributions. If you wanna use the moderated t-statistic proposed in the original SAM paper, then you need permutations. But it is also possible to use the ordinary parametric t-test (which might make sense if you have a sufficient number of samples). I have not directly implemented this in siggenes to make things not more confusing. But code for a parametric t is available in the siggenes vignette and can thus be extracted from it. Best, Holger > > Best, > > Jim > > > > > > Best, > > David > > > > > > > > 2012/2/16 James W. MacDonald<jmacdon at="" uw.edu="">: > >> Hi David, > >> > >> > >> On 2/15/2012 8:30 AM, David Westergaard wrote: > >>> Hello, > >>> > >>> I am currently working on a data set about kiwi consumption for my > >>> bachelors project. The data is available at > >>> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030 > >>> > >>> I'm abit confused as to how to interpret the output parameters, > >>> specifically p0. I've run the following code: > >>> > >>> dataset<- read.table("OAS_RMA.txt",header=TRUE) > >>> controls<- > >>> > cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25. 1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,data set$CEL9.1,dataset$CEL92.1) > >>> experiments<- > >>> > cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1 ,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,datas et$CEL7.1) > >>> > >>> library('siggenes') > >>> datamatrix<- matrix(cbind(controls,experiments),ncol=19) > >>> y<- rep(0,19) > >>> y[11:19]<- 1 > >>> gene_names<- as.character(dataset$Hybridization.REF) > >>> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345) > >>> > >>> Output: > >>> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances > >>> > >>> s0 = 0 > >>> > >>> Number of permutations: 100 > >>> > >>> MEAN number of falsely called variables is computed. > >>> > >>> Delta p0 False Called FDR cutlow cutup j2 j1 > >>> 1 0.1 0.634 28335.89 37013 0.4851 -1.058 0.354 9709 27372 > >>> 2 0.5 0.634 11200.82 21273 0.3336 -2.271 0.910 2447 35850 > >>> 3 0.9 0.634 249.38 1522 0.1038 -3.374 3.088 541 53695 > >>> 4 1.3 0.634 9.67 134 0.0457 -4.402 5.577 127 54669 > >>> 5 1.7 0.634 0.69 20 0.0219 -5.596 Inf 20 54676 > >>> 6 2.1 0.634 0 1 0 -9.072 Inf 1 54676 > >>> 7 2.5 0.634 0 1 0 -9.072 Inf 1 54676 > >>> 8 2.9 0.634 0 1 0 -9.072 Inf 1 54676 > >>> 9 3.3 0.634 0 1 0 -9.072 Inf 1 54676 > >>> 10 3.7 0.634 0 0 0 -Inf Inf 0 54676 > >>> > >>> I'm using the rand parameter because results seems to vary a bit. p0 > >>> is in this case 0.634, and I'm not sure how to interpret this. From > >>> literature, this is described as "Prior probability that a gene is not > >>> differentially expressed" - What does this exactly mean? Does this > >>> imply, that there is a ~63% percent chance, that the genes in > >>> question, are actually NOT differentially expressed? > >> > >> It means that about 63% of your genes appear to be not differentially > >> expressed. So if you choose a gene at random, there is a 63% > probability > >> that you will choose one that isn't differentially expressed. > >> > >> However, depending on the value of Delta that you choose, the > expectation is > >> that a far fewer percentage of the genes selected will be > differentially > >> expressed. In other words, you are trying to grab genes with a higher > >> probability of differential expression, and you are then estimating > what > >> percentage of those genes are still likely false positives (e.g., if > you > >> choose a Delta of 1.3, you will get 134 significant genes, and will > expect > >> that about 10 of those will be false positives). > >> > >> > >>> I've also found some varying sources saying that it is a good idea to > >>> log2 transform data before inputting into SAM. Does this still apply, > >>> and if so, why? > >> > >> This is because the t-test is based on means, which are not very robust > to > >> outliers. Gene expression data tend to have a strong right skew, > meaning > >> that most of the data are within a certain range, but there are some > values > >> much higher. If you take logs, it tends to minimize the skew, so the > large > >> values have less of an effect (on the linear scale, expression values > range > >> from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter > what > >> base you use, but people have tended to use log base 2 because then a > >> difference of 1 indicates a two-fold difference on the linear scale. > >> > >> Best, > >> > >> Jim > >> > >> > >>> Best Regards, > >>> > >>> David Westergaard > >>> Undergraduate student > >>> Technical University of Denmark > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor --

ADD REPLY • link 12.2 years ago Holger Schwender ▴ 900

Login before adding your answer.