Limma: how to combine duplicateCorrelation, dyeeffect and arrayweights?
1
0
Entering edit mode
@gordon-smyth
Last seen 6 minutes ago
WEHI, Melbourne, Australia
Dear Dorthe, The arrayWeights() function does allow you to combine array weights with spot weights. You are expected to combine the weight analysis with the most correct design matrix, so including a dye-swap effect in the design matrix is perfectly appropriate. So, no problems so far. On the other hand, any spot weight system which removes 70% of your data is bound to lead to problems down the track, whatever you do. Why do you feel you need to do this? Most of the spots which are flagged by GenePix are flagged simply because they are faint, rather than because of any other quality considerations. Since you are using normexp, which prevents low intensity spots giving over-variable log-ratios, there is no good reason for you to filter out these spots. In most case it would be better to simply ignore the GenePix flags, and keep all your spots in the analysis. If you feel uncomfortable with this, you could check the GenePix flags to see if any spots are flagged for reasons other than faintness (Flag < -50). But please don't filter out spots just because they are faint. This goes against all Bioconductor recommendations. Best wishes Gordon >Date: Wed, 7 Nov 2007 17:21:09 +0100 (CET) >From: dorthe.belgardt at medisin.uio.no >Subject: [BioC] Limma: how to combine duplicateCorrelation, dyeeffect > and arrayweights? >To: bioconductor at stat.math.ethz.ch > >Hi, > >I am quite insecure if some parts of the analyis I did in Limma are really >correct and I would highly appreciate if someone could take a look and >give advice. My main concern is that I may not use the >duplicatecorrelation, dyeeffect,arrayweights and spotweights correctly. > >The arrays I use are printed in duplicates with a spacing of 15000 (so >30000 features in total), and I did the imageprocessing in GenePixPro6.1. >Thereby I flagged all spots close to backgroundsignal and with a rgn r2 ><0.5 bad, and only 30% of my data remain unflagged. > >And this is what I did using Limma: > > > targets=readTargets("Targets_basicSat.txt") > > targets > SlideNumber FileName Cy3 Cy5 >1 1 3096_basicSat.gpr ref A >2 2 3079_basicSat.gpr A ref >3 3 3089_basicSat.gpr ref A >4 4 3081_basicSat.gpr A ref >5 5 3071_basicSat.gpr ref B >6 6 3082_basicSat.gpr B ref >7 7 3085_basicSat.gpr ref B >8 8 8268_basicSat.gpr B ref >9 9 7829_basicSat.gpr ref C >10 10 3086_basicSat.gpr C ref >11 11 7823_basicSat.gpr ref C >12 12 7826_basicSat.gpr C ref >13 13 3090_basicSat.gpr ref D >14 14 3091_basicSat.gpr D ref >15 15 3092_basicSat.gpr ref D >16 16 7827_basicSat.gpr D ref > >Every other slide is a dyeswapped technical replicate and per "group" >(A,B,C,D) there are 2 biological replicates. > > > K=read.maimages(targets$FileName, source="genepix.median", >wt.fun=wtflags(0)) > > types=readSpotTypes("SpottypesGAPDH.txt") > > Status=controlStatus(types, K) > > K$genes$Status=Status > > K3=backgroundCorrect(K, method=?normexp?, offset=50) > > K3=normalizeWithinArrays(K3, method="median") > > K3a=normalizeBetweenArrays(K3, method="quantile") > > design=modelMatrix(targets, ref="ref") > > design > A B C D > [1,] 1 0 0 0 > [2,] -1 0 0 0 > [3,] 1 0 0 0 > [4,] -1 0 0 0 > [5,] 0 1 0 0 > [6,] 0 -1 0 0 > [7,] 0 1 0 0 > [8,] 0 -1 0 0 > [9,] 0 0 1 0 >[10,] 0 0 -1 0 >[11,] 0 0 1 0 >[12,] 0 0 -1 0 >[13,] 0 0 0 1 >[14,] 0 0 0 -1 >[15,] 0 0 0 1 >[16,] 0 0 0 -1 > >Since I am expecting a non-negligible dyeeffect I created an other >designmatrix and the following contrastMatrix: > > >design1=cbind(DyeEffect=1, design) > >design.cont=makeContrasts("A", ?B?, ?A-B", levels=design1) > >Next I estimate the correlation of within-array-duplicates: > > >cor=duplicateCorrelation(K3b, design=design1, ndups=2, spacing=15000, >weights=K3b$weights) > >My first question is: is it correct to use here the designmatrix for the >dyeeffect (design1 in this case)? > >When fitting the linear model, I also want to use arrayweights, combined >with spotweights. So I gave following commands: > > > aw=arrayWeights(K3b, design=design1) > > w=matvec(K3b$weights, aw) > >Again the question: is it correct to use here the "design1"-matrix >considering the dyeeffect? > >Then I fit the linear model: > > >fit=lmFit(K3b, design=design1, ndups=2, spacing=15000, cor=cor$consensus, >weights=w) > >fit1=contrasts.fit(fit, design.cont) > >eb=eBayes(fit1) > >Another thing I am worried about is that taking into account the dyeeffect >plus arrayweights plus spotweights might be a bit "too much"? Like in a >way "overtransforming" my data? Especially since approx 70% of my data >have a spotweight of zero. Might it be better to use the spotweight of 0,1 >for bad spots, so that I do not loose the data completely? > >My apologies for this long email, I tried hard to find out the answers for >myself reading the limmaguide and lots of other documents I found >googleing, but still feel quite "stuck" in my analysis process. > >Thanks very much for any kind of help in advance! >Best regards >Dorthe > > >-- >Dorthe Belgardt >Institute of Basic Medical Sciences >Department of Physiology >P.O. Box 1103 Blindern >0317 Oslo >Norway
limma PROcess limma PROcess • 991 views
ADD COMMENT
0
Entering edit mode
@dorthebelgardtmedisinuiono-2468
Last seen 9.6 years ago
Dear Gordon, thank you very much for your reply. The reason why I filtered out so many spots is that the overall quality of my hybridizations is not that good, rather "okay". I felt uncomfortable keeping spots in the analysis which I considered not to be reliable. So I decided to filter out weak and inhomogenous spots and I was told that it is not uncommon to lose so many datapoints by filtering. I have another dataset of 2-colour-cDNA-arrays, where I applied the same filtering and also flagged about 70% of the spots per array. If I compare the topTables doing the analysis A) using spotweights and B) without using spotweights the overlap is close to 100%, meaning that the top100 genes I get using spotweights also show significant DE in the other analysis. On the other hand, if I use no spotweights and keep all the spots in the analysis process, I end up with genes in my toptables which I would consider to be not reliable and which received a negative flagging in GenePix. And as the topTables do not differ that much, I was wondering in what way a stringent filtering could "lead to problems down the track"? But beside that, I would be thankful for some more advice how to handle the duplicateCorrelation function - or better how to interpret its result. I ran two microarray experiments using basicly the same design, but one array is printed in duplicates, the second in singlets only. My design matrix looks like this: design A B C D [1,] 1 0 0 0 [2,] -1 0 0 0 [3,] 1 0 0 0 [4,] -1 0 0 0 [5,] 1 0 0 0 [6,] -1 0 0 0 [7,] 0 1 0 0 [8,] 0 -1 0 0 [9,] 0 1 0 0 [10,] 0 -1 0 0 [11,] 0 1 0 0 [12,] 0 -1 0 0 [13,] 0 0 1 0 [14,] 0 0 -1 0 [15,] 0 0 1 0 [16,] 0 0 -1 0 [17,] 0 0 1 0 [18,] 0 0 -1 0 [19,] 0 0 0 1 [20,] 0 0 0 -1 [21,] 0 0 0 1 [22,] 0 0 0 -1 [23,] 0 0 0 1 [24,] 0 0 0 -1 For the arrays printed in singlets I tried to estimate the correlation for replicated arrays, using: > biolrep=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12) cor=duplicateCorrelation(FCmq, design=design, ndups=1, block=biolrep) FCmq contains the bg corrected, normalized data. The cor$consenus I get is: [1] 0.1396215 According to what I read in the LimmaGuide this value should be negative due to the dyeswap design. What do I do if get a postive value here? Interpret it that way that there seems to be no correlation and treat all the arrays as independent and simply skipp using the dupCor function? Something similar happens when I use the duplicate correlation on the other dataset for estimating the correlation of within-array- replicates. Using the same designMatrix, I get > cor$consenus [1] 0,2723845 The Limma guide says that this value should be greater than 0.4. What do I do if the correlation is below 0.4? And do I understand it correctly that the within-array-correlation is not "influenced" by the dyeswap design, so that I can expect to get a positive value here? I'd be very happy if I could get some help for these questions as well! Thanks a lot for your time and best regards, Dorthe > Dear Dorthe, > > The arrayWeights() function does allow you to combine array weights with spot weights. You are expected to combine the weight analysis with the most correct design matrix, so including a dye-swap effect in the design matrix is perfectly appropriate. So, no problems so far. > > On the other hand, any spot weight system which removes 70% of your data is bound to lead to problems down the track, whatever you do. Why do you feel you need to do this? > > Most of the spots which are flagged by GenePix are flagged simply because they are faint, rather than because of any other quality considerations. Since you are using normexp, which prevents low > intensity spots giving over-variable log-ratios, there is no good reason for you to filter out these spots. In most case it would be better to simply ignore the GenePix flags, and keep all your spots in the analysis. > > If you feel uncomfortable with this, you could check the GenePix flags to see if any spots are flagged for reasons other than > faintness (Flag < -50). But please don't filter out spots just > because they are faint. This goes against all Bioconductor > recommendations. > > Best wishes > Gordon > >>Date: Wed, 7 Nov 2007 17:21:09 +0100 (CET) >>From: dorthe.belgardt at medisin.uio.no >>Subject: [BioC] Limma: how to combine duplicateCorrelation, dyeeffect >> and arrayweights? >>To: bioconductor at stat.math.ethz.ch >>Hi, >>I am quite insecure if some parts of the analyis I did in Limma are >> really >>correct and I would highly appreciate if someone could take a look and give advice. My main concern is that I may not use the >>duplicatecorrelation, dyeeffect,arrayweights and spotweights correctly. The arrays I use are printed in duplicates with a spacing of 15000 (so 30000 features in total), and I did the imageprocessing in GenePixPro6.1. >>Thereby I flagged all spots close to backgroundsignal and with a rgn r2 <0.5 bad, and only 30% of my data remain unflagged. >>And this is what I did using Limma: >> > targets=readTargets("Targets_basicSat.txt") >> > targets >> SlideNumber FileName Cy3 Cy5 >>1 1 3096_basicSat.gpr ref A >>2 2 3079_basicSat.gpr A ref >>3 3 3089_basicSat.gpr ref A >>4 4 3081_basicSat.gpr A ref >>5 5 3071_basicSat.gpr ref B >>6 6 3082_basicSat.gpr B ref >>7 7 3085_basicSat.gpr ref B >>8 8 8268_basicSat.gpr B ref >>9 9 7829_basicSat.gpr ref C >>10 10 3086_basicSat.gpr C ref >>11 11 7823_basicSat.gpr ref C >>12 12 7826_basicSat.gpr C ref >>13 13 3090_basicSat.gpr ref D >>14 14 3091_basicSat.gpr D ref >>15 15 3092_basicSat.gpr ref D >>16 16 7827_basicSat.gpr D ref >>Every other slide is a dyeswapped technical replicate and per "group" (A,B,C,D) there are 2 biological replicates. >> > K=read.maimages(targets$FileName, source="genepix.median", >>wt.fun=wtflags(0)) >> > types=readSpotTypes("SpottypesGAPDH.txt") >> > Status=controlStatus(types, K) >> > K$genes$Status=Status >> > K3=backgroundCorrect(K, method=?normexp?, offset=50) >> > K3=normalizeWithinArrays(K3, method="median") >> > K3a=normalizeBetweenArrays(K3, method="quantile") >> > design=modelMatrix(targets, ref="ref") >> > design >> A B C D >> [1,] 1 0 0 0 >> [2,] -1 0 0 0 >> [3,] 1 0 0 0 >> [4,] -1 0 0 0 >> [5,] 0 1 0 0 >> [6,] 0 -1 0 0 >> [7,] 0 1 0 0 >> [8,] 0 -1 0 0 >> [9,] 0 0 1 0 >>[10,] 0 0 -1 0 >>[11,] 0 0 1 0 >>[12,] 0 0 -1 0 >>[13,] 0 0 0 1 >>[14,] 0 0 0 -1 >>[15,] 0 0 0 1 >>[16,] 0 0 0 -1 >>Since I am expecting a non-negligible dyeeffect I created an other designmatrix and the following contrastMatrix: >> >design1=cbind(DyeEffect=1, design) >> >design.cont=makeContrasts("A", ?B?, ?A-B", levels=design1) >>Next I estimate the correlation of within-array-duplicates: >> >cor=duplicateCorrelation(K3b, design=design1, ndups=2, spacing=15000, >>weights=K3b$weights) >>My first question is: is it correct to use here the designmatrix for the dyeeffect (design1 in this case)? >>When fitting the linear model, I also want to use arrayweights, combined with spotweights. So I gave following commands: >> > aw=arrayWeights(K3b, design=design1) >> > w=matvec(K3b$weights, aw) >>Again the question: is it correct to use here the "design1"-matrix considering the dyeeffect? >>Then I fit the linear model: >> >fit=lmFit(K3b, design=design1, ndups=2, spacing=15000, >> cor=cor$consensus, >>weights=w) >> >fit1=contrasts.fit(fit, design.cont) >> >eb=eBayes(fit1) >>Another thing I am worried about is that taking into account the >> dyeeffect >>plus arrayweights plus spotweights might be a bit "too much"? Like in a way "overtransforming" my data? Especially since approx 70% of my data have a spotweight of zero. Might it be better to use the spotweight of >> 0,1 >>for bad spots, so that I do not loose the data completely? >>My apologies for this long email, I tried hard to find out the answers >> for >>myself reading the limmaguide and lots of other documents I found googleing, but still feel quite "stuck" in my analysis process. >>Thanks very much for any kind of help in advance! >>Best regards >>Dorthe >>-- >>Dorthe Belgardt >>Institute of Basic Medical Sciences >>Department of Physiology >>P.O. Box 1103 Blindern >>0317 Oslo >>Norway > > -- Dorthe Belgardt Institute of Basic Medical Sciences Department of Physiology P.O. Box 1103 Blindern 0317 Oslo Norway
0
Entering edit mode
At 06:05 AM 26/11/2007, dorthe.belgardt at medisin.uio.no wrote: >Dear Gordon, > >thank you very much for your reply. The reason why I filtered out so many >spots is that the overall quality of my hybridizations is not that good, >rather "okay". I felt uncomfortable keeping spots in the analysis which I >considered not to be reliable. So I decided to filter out weak and >inhomogenous spots and I was told that it is not uncommon to lose so many >datapoints by filtering. Not in my kneck of the woods. >I have another dataset of 2-colour-cDNA-arrays, where I applied the same >filtering and also flagged about 70% of the spots per array. If I compare >the topTables doing the analysis A) using spotweights and B) without using >spotweights the overlap is close to 100%, meaning that the top100 genes I >get using spotweights also show significant DE in the other analysis. On >the other hand, if I use no spotweights and keep all the spots in the >analysis process, I end up with genes in my toptables which I would >consider to be not reliable and which received a negative flagging in >GenePix. And as the topTables do not differ that much, I was wondering in >what way a stringent filtering could "lead to problems down the track"? The problem is not that your filtering is stringent but rather that your filtering is not based on quality criteria. It is perfectly acceptable to filter out probes which appear not to be expressed in any experimental condition. This would have the effect that you want to achieve. But this means that you must remove those probes entirely from your analysis, not that you selectively remove some values for those probes and leave others in. If you remove individual spots merely because they are low intensity, this potentially leads to a variety of problems and biases. Suppose your example that a particular gene is expressed in WT but absent in the mutant. You might miss this important result entirely because you filtered out all the spots in the mutant. More pervasively, this selective filtering introduces biases into any statistical analysis of the affected genes, because the filtering is based on the intensity value itself. Removing all your low intensity spots prior to normalisation also plays havoc with loess normalisation, which expects to see the whole intensity range. The loess curve may now be uncertainly estimated. >But beside that, I would be thankful for some more advice how to handle the >duplicateCorrelation function - or better how to interpret its result. I >ran two microarray experiments using basicly the same design, but one >array is printed in duplicates, the second in singlets only. My design >matrix looks like this: > > design > A B C D > [1,] 1 0 0 0 > [2,] -1 0 0 0 > [3,] 1 0 0 0 > [4,] -1 0 0 0 > [5,] 1 0 0 0 > [6,] -1 0 0 0 > [7,] 0 1 0 0 > [8,] 0 -1 0 0 > [9,] 0 1 0 0 >[10,] 0 -1 0 0 >[11,] 0 1 0 0 >[12,] 0 -1 0 0 >[13,] 0 0 1 0 >[14,] 0 0 -1 0 >[15,] 0 0 1 0 >[16,] 0 0 -1 0 >[17,] 0 0 1 0 >[18,] 0 0 -1 0 >[19,] 0 0 0 1 >[20,] 0 0 0 -1 >[21,] 0 0 0 1 >[22,] 0 0 0 -1 >[23,] 0 0 0 1 >[24,] 0 0 0 -1 > > >For the arrays printed in singlets I tried to estimate the correlation for >replicated arrays, using: > > biolrep=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12) >cor=duplicateCorrelation(FCmq, design=design, ndups=1, block=biolrep) >FCmq contains the bg corrected, normalized data. The cor$consenus I get is: >[1] 0.1396215 > >According to what I read in the LimmaGuide this value should be negative >due to the dyeswap design. What do I do if get a postive value here? >Interpret it that way that there seems to be no correlation and treat all >the arrays as independent and simply skipp using the dupCor function? Yes. >Something similar happens when I use the duplicate correlation on the >other dataset for estimating the correlation of within-array- replicates. >Using the same designMatrix, I get > > cor$consenus [1] 0,2723845 >The Limma guide says that this value should be greater than 0.4. I have checked through the guide just now but cannot find such a statement. Where is it? > What do I >do if the correlation is below 0.4? I suppose that you use what you have. > And do I understand it correctly that >the within-array-correlation is not "influenced" by the dyeswap design, so >that I can expect to get a positive value here? Yes. Gordon >I'd be very happy if I could get some help for these questions as well! >Thanks a lot for your time and best regards, > >Dorthe
ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6