Limma issues.
1
0
Entering edit mode
Marcus ▴ 150
@marcus-410
Last seen 9.6 years ago
Hi limma users. I have some general questions regarding the use of dupcor.series and gls.series functions. I use an experimental design that compares three different conditions to a fourth (which I will call a reference). So the design of the experiment looks something like: Cond 1 Cond2 Cond2 Ref (There are supposed to be double arrows between the different conditions and the reference.) The experiment is done in four (and between some conditions, five) biological replicates, so in total I have made 4 times 6 (dye-swap) = 24 hybridizations which compares all the conditions with the reference, plus four additional hybridizations which is a dye-swap between cond2 and reference plus cond3 and the reference. This gives in total 27 slides since one slide was excluded due to a terrible hybridization. Each slide has a duplicate spot. This data has been filtered using different filter cut-offs and the remaining data has been put into a matrix of M-values. The data has been normalized and dye-swapped. From this point I have made two different analysis paths. One analysis path was to divide the M-matrix into three different matrices which contains data from the different conditions and I continued using the matrices with the following command: dupcor.series(M.matrix.cond1,design,ndups=2,spacing=13824)->corr Where the design was something like: design <- c(1,1,1,1,1,1,1,1) The correlation is later used in the function: gls.series(M.matrix.cond1, design, ndups=2,spacing=13824,corr$cor)->fit and continuing with: eb<-ebayes(fit) the data from ebayes is later used in the toptable function: toptable(number=100,genelist=list[,],fit=fit,eb=eb,adjust="fdr") When I make this analysis I get a list of 82 genes for cond1 that has a B-value larger than zero. But if I use another strategy where I use all the slides in the same analysis and use a design matrix which looks something like: Con1 Con2 Con3 1 1 0 0 2 1 0 0 3 1 0 0 4 1 0 0 5 1 0 0 6 1 0 0 7 1 0 0 8 1 0 0 9 0 1 0 10 0 1 0 11 0 1 0 12 0 1 0 13 0 1 0 14 0 1 0 15 0 1 0 16 0 1 0 17 0 1 0 18 0 1 0 19 0 0 1 20 0 0 1 21 0 0 1 22 0 0 1 23 0 0 1 24 0 0 1 25 0 0 1 26 0 0 1 27 0 0 1 And make the function calls like the procedures above except the toptable where I specify the coef number to be the condition which I want to investigate. The thing is, when I use this approach I get more genes that have a b-value greater than zero, namely 110. My guess is that when I run the dupcor.series the correlations is based on all the slides in some way, since corr$cor is a lonely number and not a correlation for every comparison in the experimental design. The cor numbers from the different analysis pathways are not equal either. When I run the gls.series function I get a fit$sigma that is based on all the slides or? What is actually happening when the sigma is calculated. The sigma is also generally smaller when all the slides are included compared to the analysis path where I used one condition at a time, which I guess should be the case if all slides are included. The lods which is calculated in the ebayes function is larger then the lods which originates from the analysis path where I examined one condition at a time, so this result in more genes that is larger than zero. So to my questions (sorry for the long introduction): Is the design matrix correct when the experiment is performed the way it is? According to the user guide, this is an optional way when you have a reference design but since the result is deviating from the one comparison at a time I am not sure. If the design is correct, which of the two analysis pathways is the correct/better one? I dont consider myself as a statistician, more like a biologist, but I reckon the sigma factor as a crucial variable in these analyses. I changed the fit$sigma for the "one condition" analysis path to the sigma that originates from the analysis that contains all the slides and the number of genes that reach above B-value zero increased to 111 which is quite like the 110 that I got from the other analysis pathway. So I guess that the sigma factor is quite responsible for the B-value spectra. If I would like to compare, lets say the cond2 with cond3 can I then add a another column to the design matrix, and if so, how would it look like. During these analysis I have used R-version 1.8 and limma version 1.3.1 I would be very grateful for some response. Regards Marcus ********************************************************************** ********************* Marcus Gry Bj?rklund Royal Institute of Technology AlbaNova University Center Stockholm Center for Physics, Astronomy and Biotechnology Department of Molecular Biotechnology 106 91 Stockholm, Sweden Phone (office): +46 8 553 783 45 Fax: + 46 8 553 784 81 Visiting adress: Roslagstullsbacken 21, Floor 3 Delivery adress: Roslagsv?gen 30B
Pathways limma Pathways limma • 888 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia
Dear Marcus, Firstly, I think you'd find it helpful to read Section 7.3 in the Limma User's Guide which does address some of your questions. Everything in Section 7 was new in Limma 1.3.0 so you may not have seen it yet. Secondly, it is generally best to analyse all of your data at once. It is no surprise that you get more differentially expressed (DE) genes when using all of the data. Obviously with more data you can have more confidence in your results, hence a greater number of genes pass the threshold of being judged DE. What is happening specifically is that the gene-wise standard deviation is estimated more reliably when using all of your data, so there are more degrees of freedom associated with your tests, hence more results significant. Regards Gordon Some more specific responses below. At 07:51 AM 12/11/2003, Marcus wrote: >Hi limma users. > >I have some general questions regarding the use of dupcor.series and >gls.series functions. > >I use an experimental design that compares three different conditions to a >fourth (which I will call a reference). >So the design of the experiment looks something like: > > >Cond 1 Cond2 Cond2 > > > Ref > >(There are supposed to be double arrows between the different conditions >and the reference.) > >The experiment is done in four (and between some conditions, five) >biological replicates, so in total I have made 4 times 6 (dye-swap) = 24 >hybridizations which compares all the conditions with the reference, plus >four additional hybridizations which is a dye-swap between cond2 and >reference plus cond3 and the reference. This gives in total 27 slides >since one slide was excluded due to a terrible hybridization. Each slide >has a duplicate spot. > >This data has been filtered using different filter cut-offs and the >remaining data has been put into a matrix of M-values. The data has been >normalized and dye-swapped. > > From this point I have made two different analysis paths. > >One analysis path was to divide the M-matrix into three different matrices >which contains data from the different conditions and I continued using >the matrices with the following >command: dupcor.series(M.matrix.cond1,design,ndups=2,spacing=13824)->corr > >Where the design was something like: design <- c(1,1,1,1,1,1,1,1) > >The correlation is later used in the function: gls.series(M.matrix.cond1, >design, ndups=2,spacing=13824,corr$cor)->fit > >and continuing with: eb<-ebayes(fit) > >the data from ebayes is later used in the toptable >function: toptable(number=100,genelist=list[,],fit=fit,eb=eb,adjust="fdr") > >When I make this analysis I get a list of 82 genes for cond1 that has a >B-value larger than zero. > >But if I use another strategy where I use all the slides in the same >analysis and use a design matrix which looks something like: > > Con1 Con2 Con3 >1 1 0 0 >2 1 0 0 >3 1 0 0 >4 1 0 0 >5 1 0 0 >6 1 0 0 >7 1 0 0 >8 1 0 0 >9 0 1 0 >10 0 1 0 >11 0 1 0 >12 0 1 0 >13 0 1 0 >14 0 1 0 >15 0 1 0 >16 0 1 0 >17 0 1 0 >18 0 1 0 >19 0 0 1 >20 0 0 1 >21 0 0 1 >22 0 0 1 >23 0 0 1 >24 0 0 1 >25 0 0 1 >26 0 0 1 >27 0 0 1 > >And make the function calls like the procedures above except the toptable >where I specify the coef number to be the condition which I want to >investigate. The thing is, when I use this approach I get more genes that >have a b-value greater than zero, namely 110. > >My guess is that when I run the dupcor.series the correlations is based on >all the slides in some way, since corr$cor is a lonely number and not a >correlation for every comparison in the experimental design. The cor >numbers from the different analysis pathways are not equal either. > >When I run the gls.series function I get a fit$sigma that is based on all >the slides or? > >What is actually happening when the sigma is calculated. The sigma is also >generally smaller when all the slides are included compared to the >analysis path where I used one condition at a time, which I guess should >be the case if all slides are included. I don't think this can be correct. Sigma measures the spot to spot variability of measured log-ratios for a given gene. The whole-data estimate for each gene should be intermediate between the individual condition estimates. >The lods which is calculated in the ebayes function is larger then the >lods which originates from the analysis path where I examined one >condition at a time, so this result in more genes that is larger than zero. > >So to my questions (sorry for the long introduction): > >Is the design matrix correct when the experiment is performed the way it >is? According to the user guide, this is an optional way when you have a >reference design but since the result is deviating from the one comparison >at a time I am not sure. You have dye-swaps but your design matrices do not reflect this. Apart from this, your design may be correct. >If the design is correct, which of the two analysis pathways is the >correct/better one? Almost always, the better approach is to use all your data. >I dont consider myself as a statistician, more like a biologist, but I >reckon the sigma factor as a crucial variable in these analyses. I changed >the fit$sigma for the "one condition" analysis path to the sigma that >originates from the analysis that contains all the slides and the number >of genes that reach above B-value zero increased to 111 which is quite >like the 110 that I got from the other analysis pathway. So I guess that >the sigma factor is quite responsible for the B-value spectra. It is one factor that is used in the computation. >If I would like to compare, lets say the cond2 with cond3 can I then add a >another column to the design matrix, and if so, how would it look like. No, you do this using a contrast matrix. Please see Section 7.3 in the User's Guide. >During these analysis I have used R-version 1.8 and limma version 1.3.1 > >I would be very grateful for some response. > >Regards > >Marcus > > >********************************************************************* ********************** >Marcus Gry Bj?rklund > >Royal Institute of Technology >AlbaNova University Center >Stockholm Center for Physics, Astronomy and Biotechnology >Department of Molecular Biotechnology >106 91 Stockholm, Sweden > >Phone (office): +46 8 553 783 45 >Fax: + 46 8 553 784 81 >Visiting adress: Roslagstullsbacken 21, Floor 3 >Delivery adress: Roslagsv?gen 30B
ADD COMMENT

Login before adding your answer.

Traffic: 919 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6