Detection of differential expression using limma
1
0
Entering edit mode
@christian-eisen-3074
Last seen 7.1 years ago
Hello to all, first of all I have to apologize for posting this several times, oviously the web interface of my institute mail adress doesn't let me post via mailing list...whatever... I am quite new to the topic of microarray analysis but I have a solid background on R. I am running a one-color Array from Agilent for my diploma thesis and I am currently analyzing the data I got. I always thought that, when looking at the raw data, if a gene gives a low value (intensity) within one considered group and a high value (value) within the other group considered this particular gene can be regarded to be upregulated respectively downregulated, whatever comparison of the two groups is interesting. However I was proven wrong when looking at preprocessed data of such two genes. The now transformed data for the two genes doesn't differ at all or very much. Sure I know it is log2 transformed and processed but however, going from a 10-fold change down to a 0,6-fold change seems a little bit far off for me. There can't be that much noise in such a signal...or am I wrong? As I said maybe my limited knowledge just doesn't let me interpret the data apropriately therfore I would be grateful for any kind of help. I tried VSN normalization, quantile normalization and a simple log2 transformation of the raw data. For differential expression analysis I use linear models implemented in the limma package as described in the vignette for one-color data. Oddly enough these analysis always brings up genes which aren't differentially expressed at all between the two considered groups. Rather than that it lists genes as "significantly" differently expressed between the two considered groups which show a raw intensity data as well as normalized data fold-change of maximal 1.0. I really hope someone can help me clearify some of the issues mentioned above,since I really can't expect any help from my people here since they are even more unfamiliar than me. I appreciate any kind of input! Thanks a lot and best wishes Christian Christian Eisen Division of Experimental Medicine Ludwig-Maximilians-University Munich Marchioninistr. 15, 81377 Munich, Germany
0
Entering edit mode
@christian-eisen-3075
Last seen 7.1 years ago
Hello to all, first of all I have to apologize for posting this several times, oviously the web interface of my institute mail adress doesn't let me post via mailing list...whatever... I am quite new to the topic of microarray analysis but I have a solid background on R. I am running a one-color Array from Agilent for my diploma thesis and I am currently analyzing the data I got. I always thought that, when looking at the raw data, if a gene gives a low value (intensity) within one considered group and a high value (value) within the other group considered this particular gene can be regarded to be upregulated respectively downregulated, whatever comparison of the two groups is interesting. However I was proven wrong when looking at preprocessed data of such two genes. The now transformed data for the two genes doesn't differ at all or very much. Sure I know it is log2 transformed and processed but however, going from a 10-fold change down to a 0,6-fold change seems a little bit far off for me. There can't be that much noise in such a signal...or am I wrong? As I said maybe my limited knowledge just doesn't let me interpret the data apropriately therfore I would be grateful for any kind of help. I tried VSN normalization, quantile normalization and a simple log2 transformation of the raw data. For differential expression analysis I use linear models implemented in the limma package as described in the vignette for one-color data. Oddly enough these analysis always brings up genes which aren't differentially expressed at all between the two considered groups. Rather than that it lists genes as "significantly" differently expressed between the two considered groups which show a raw intensity data as well as normalized data fold-change of maximal 1.0. I really hope someone can help me clearify some of the issues mentioned above,since I really can't expect any help from my people here since they are even more unfamiliar than me. I appreciate any kind of input! Thanks a lot and best wishes Christian Christian Eisen Division of Experimental Medicine Ludwig-Maximilians-University Munich Marchioninistr. 15, 81377 Munich, Germany
0
Entering edit mode
On Mon, Oct 13, 2008 at 11:40 AM, Christian Eisen <mai at="" christianeisen.com=""> wrote: > Hello to all, > > first of all I have to apologize for posting this several times, oviously > the web interface of my institute mail adress doesn't let me post via > mailing list...whatever... > I am quite new to the topic of microarray analysis but I have a solid > background > on R. I am running a one-color Array from Agilent for my diploma thesis and > I am currently analyzing the data I got. > I always thought that, when looking at the raw data, if a gene gives a low > value > (intensity) within one considered group and a high value (value) within the > other group considered > this particular gene can be regarded to be upregulated respectively > downregulated, whatever > comparison of the two groups is interesting. > However I was proven wrong when looking at preprocessed data of such two > genes. > The now transformed data for the two genes doesn't differ at all or very > much. > Sure I know it is log2 transformed and processed but however, going from a > 10-fold change > down to a 0,6-fold change seems a little bit far off for me. > There can't be that much noise in such a signal...or am I wrong? > As I said maybe my limited knowledge just doesn't let me interpret the data > apropriately > therfore I would be grateful for any kind of help. > > I tried VSN normalization, quantile normalization and a simple log2 > transformation of the raw data. > For differential expression analysis I use linear models implemented in the > limma package > as described in the vignette for one-color data. > Oddly enough these analysis always brings up genes which aren't > differentially expressed at all > between the two considered groups. Rather than that it lists genes as > "significantly" differently expressed > between the two considered groups which show a raw intensity data as well as > normalized data fold-change of maximal 1.0. > > I really hope someone can help me clearify some of the issues mentioned > above,since > I really can't expect any help from my people here since they are even more > unfamiliar than me. > I appreciate any kind of input! > > Thanks a lot and best wishes > Christian Hi, Christian. How are you calculating fold change? When calculating fold change on the log scale, you need to subtract and not divide. If that isn't the problem, then you will need to be a bit more explicit about what you have done and what the numbers are. Providing some code, some sample output showing the problem, and your sessionInfo() are all important pieces to getting useful help. Sean
0
Entering edit mode
Hi Sean and everyone, thank you for your reply. I was aware that for log intensities the fold change is calculated via substraction. I have posted on example of intensity values for a ample gene in the two groups I am interessted in. I have listed all the respective values for that particular gene after transformation with the VSN or quantile method as well as simple log2 transformation of the raw data. hsa-let-7a (gene) 10.4093909361377 12.4491486453754 (log2 transformed data) 10.0910975543547 10.5456558587892 (VSN transformed data) 10.1711963483519 10.4027789253209 (quantile transformed data) 1360 5592 (raw data) Like mentioned previously I am using the limma package. I read the tab-deliminted files containing the raw data as advised by Gordon Smyth and Peter White in a previous post handling the same topic. mrawObj <- read.maimages(targets$FileName, columns = list(G = "gMedianSignal", Gb = "gBGUsed", R = "gProcessedSignal",Rb = "gBGMedianSignal"), annotation= c("ProbeName", "GeneName", "SystematicName","ControlType")) mnormObj <- mrawObj Then I perform normalization only on the green channel mnormObj$G <-normalizeBetweenArrays(mnormObj$G,method="quantile") mnormObj$G <- log2(mnormObj$G) After that I extract the transformed intensity data for the green channel into a new object and name the rows after the genes. As I have data for 16 arrays, this newly created object contains 16 columns, one for each array, and around 13.000 rows, one for each individual gene. preproc_data<-mnormObj_vsn$G rownames(preproc_data)<-mnormObj_vsn$genes$GeneName For identification of differentially expressed genes I proceed as described in the limma userguide by defining a model.matrix and a cont.matrix to set the comparisons that I want the program to compute. After this I fit a linear model to my data. fit <- lmFit(preproc_data, design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) and display the results for the different comparisons listed in the cont.matrix Below a sample output from the VSN transformed data > topTable(fit2, coef=1, adjust="BH") ID logFC t P.Value adj.P.Val B ebv-miR-BART18-5p -2.788856 -3.925838 0.001719277 0.9999662 -4.501933 hsa-miR-671-5p 2.378257 3.875517 0.001891590 0.9999662 -4.503216 hsa-miR-663 2.308951 3.872746 0.001901578 0.9999662 -4.503287 hsa-miR-583 -1.509068 -3.758990 0.002361635 0.9999662 -4.506259 hsa-miR-671-5p 2.097918 3.661009 0.002848245 0.9999662 -4.508897 hsa-miR-671-5p 2.055704 3.576170 0.003351448 0.9999662 -4.511239 hsa-miR-663 2.354609 3.513237 0.003782239 0.9999662 -4.513012 hsa-miR-671-5p 2.136401 3.480265 0.004029907 0.9999662 -4.513952 hsa-miR-145* -3.167342 -3.452580 0.004250482 0.9999662 -4.514748 hsa-miR-373 1.602810 3.381800 0.004871484 0.9999662 -4.516809 so the strange thing is, when looking at the entry for the top hit indetified by limma anylsis you get something like this ebv-miR-BART18-5p 5.20945336562895 5.4093909361377 (log2 transformation) -0.971205336133305 1.69436870845910 (VSN transformed data) 5.19967234483636 5.31174831500496 (quantile transformation) 37 42.5 (raw data) Now you may say, sure this is clearly a relict of VSN normalization. No doubt about that but for the other methods the problem is still similar just the names of the genes are different and the fold change and p-Values. So as you see I don't understand why, for example the example above let-7a doesn't show up in the top differentially expressed gene list, while others, which a clearly not differentially expressed do. Thanks for any kind of advice helping me out on this!
0
Entering edit mode
On Mon, Oct 13, 2008 at 1:18 PM, Christian Eisen <christianeisen at="" alice-dsl.de=""> wrote: > Hi Sean and everyone, > > thank you for your reply. I was aware that for log intensities the fold > change is calculated via substraction. > I have posted on example of intensity values for a ample gene in the two > groups I am interessted in. > I have listed all the respective values for that particular gene after > transformation with the VSN or quantile method > as well as simple log2 transformation of the raw data. > > hsa-let-7a (gene) > 10.4093909361377 12.4491486453754 (log2 transformed data) > 10.0910975543547 10.5456558587892 (VSN transformed data) > 10.1711963483519 10.4027789253209 (quantile transformed data) > 1360 5592 (raw data) > > Like mentioned previously I am using the limma package. > I read the tab-deliminted files containing the raw data as advised by Gordon > Smyth and Peter White in a > previous post handling the same topic. > > mrawObj <- read.maimages(targets$FileName, > columns = list(G = "gMedianSignal", Gb = "gBGUsed", R = > "gProcessedSignal",Rb = "gBGMedianSignal"), > annotation= c("ProbeName", "GeneName", "SystematicName","ControlType")) > mnormObj <- mrawObj These arrays appear to be Agilent miRNA arrays. I think the Agilent recommendation is to use the Total Gene Signal for these arrays and not the processed signal--you'll want to read the image extraction software manual for the details. Also, normalization of miRNA arrays is not a solved problem because such a large proportion of the data show differential expression. I think there is a general feeling that methods like vsn might be "better" than more traditional methods, but I wouldn't say that there is a consensus on that and I do not have literature to support that claim (perhaps others do). There are several discussions in the email archive on normalization of miRNA. I would suggest looking back at them for other ideas. Sean > Then I perform normalization only on the green channel > > mnormObj$G <-normalizeBetweenArrays(mnormObj$G,method="quantile") > mnormObj$G <- log2(mnormObj$G) > > After that I extract the transformed intensity data for the green channel > into a new > object and name the rows after the genes. As I have data for 16 arrays, this > newly > created object contains 16 columns, one for each array, and around 13.000 > rows, > one for each individual gene. > > preproc_data<-mnormObj_vsn$G > rownames(preproc_data)<-mnormObj_vsn$genes$GeneName > > For identification of differentially expressed genes I proceed as described > in the limma userguide by defining a model.matrix and a cont.matrix to set > the > comparisons that I want the program to compute. > After this I fit a linear model to my data. > > fit <- lmFit(preproc_data, design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > > and display the results for the different comparisons listed in the > cont.matrix > > Below a sample output from the VSN transformed data > >> topTable(fit2, coef=1, adjust="BH") > ID logFC t > P.Value adj.P.Val B > ebv-miR-BART18-5p -2.788856 -3.925838 0.001719277 > 0.9999662 -4.501933 > hsa-miR-671-5p 2.378257 3.875517 0.001891590 > 0.9999662 -4.503216 > hsa-miR-663 2.308951 3.872746 0.001901578 > 0.9999662 -4.503287 > hsa-miR-583 -1.509068 -3.758990 0.002361635 > 0.9999662 -4.506259 > hsa-miR-671-5p 2.097918 3.661009 0.002848245 > 0.9999662 -4.508897 > hsa-miR-671-5p 2.055704 3.576170 0.003351448 > 0.9999662 -4.511239 > hsa-miR-663 2.354609 3.513237 0.003782239 > 0.9999662 -4.513012 > hsa-miR-671-5p 2.136401 3.480265 0.004029907 > 0.9999662 -4.513952 > hsa-miR-145* -3.167342 -3.452580 0.004250482 > 0.9999662 -4.514748 > hsa-miR-373 1.602810 3.381800 0.004871484 > 0.9999662 -4.516809 > > so the strange thing is, when looking at the entry for the top hit > indetified by limma anylsis you get something like this > > ebv-miR-BART18-5p 5.20945336562895 5.4093909361377 (log2 > transformation) > -0.971205336133305 1.69436870845910 (VSN transformed data) > 5.19967234483636 5.31174831500496 (quantile transformation) > 37 42.5 (raw data) > > Now you may say, sure this is clearly a relict of VSN normalization. No > doubt about that > but for the other methods the problem is still similar just the names of the > genes are different > and the fold change and p-Values. > > So as you see I don't understand why, for example the example above let-7a > doesn't show up > in the top differentially expressed gene list, while others, which a clearly > not differentially expressed > do. > > Thanks for any kind of advice helping me out on this! > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
Entering edit mode
Yes you are right, this is data from a Agilent miRNA microarray. I have studied the Agilent protocol and the TotalGeneSignal is a processed signal, background substracted and kind of normalized, they didn't specify it what method they are using or I just didn't read it properly... Well I am not using the processedSignal, I only loaded it in the red channel because read.images from limma doesn't work otherwise. So this is a way around. All subsequent steps like normalization etc are just performed on the green channel. I know that there is still debate about the proper normalization method for miRNA data. There is a paper dealing with that perticular subject Davidson T.S., Johnson C.D. and Andruss B.F. "Analyzing micro-RNA expression using microarrays" Methods in Enzymology 411(1):14-34, 2006. However they are suggesting to do a VSN normalization. Previous studies used the median normalization method, however I thought that median normalization is just performed for "within-array normalization" and not for "between-array normalization". I may as well be wrong, like I said I am not really familiar with the topic of microarray processing. Nevertheless I haven't found a method to process my data using median normalization as all methods available use both channels, red and green. But if anybody knows how to do median normalization on one-color data I am happy to learn. As far as I can interpret the data I think that VSN is not working that well as it estimates a ML-estimator form the data available. And data point lower than the estimator get a negative value. So for the replicate spot of the genes if one of them gets a lower value than the rest of the replicates, it ends up getting a negative value and voil?...significance...even though the data doesn't show it... Wolfgang Huber suggested in another thread to do the VSN normalization just for the spike-in controls and the use the estimators derived from that on the rest of the data. But this seems to be a heavy task and I spent my whole day on studying the vignettes and other data on google concerning this topic but I just have no clue how to do it. So if I can't figure that out I will probably stick to the log2 transformed data as it shows at least reasonable hits in the results.