Question on unbalanced paired design

0

Entering edit mode

Sandhya Pemmasani Kiran ▴ 40

@sandhya-pemmasani-kiran-5878

Last seen 7.2 years ago

India

Dear List: We are analyzing Agilent microarray data for a study where samples are related. After Quantile normalization on 'gProcessedSignal', averaging replicate spots and log transformation, we are trying to use LIMMA for differential expression analysis. Design is as follows- 4 Treatment groups - A, B, C and D 3 Doses per Treatment group, but 4 doses for Treatment A (Total 13 Treatment-Dose combinations) There are 8 patient samples in each Treatment-Dose combination (Total 104 samples) We are interested in comparing Dose effects within Treatments and overlaps across Treatment-Dose combinations. No Treatment comparisons like A vs. B Patient samples are related within a Treatment group. But they differ from treatment to treatment. So, this is a nested design, but samples are related/paired. These samples are coming from 32 patients. Out of 104 samples, 12 samples failed in Extraction/Hybridization QC and we are currently analyzing 92 samples. We missed few of the paired samples in each Treatment-Dose group. Here are the few lines of targets file (attached is full targets file)- SampleName Trt Dose SibShip A-01-001 A 1 1 A-03-001 A 3 1 A-04-001 A 4 1 A-01-012 A 1 6 A-02-012 A 2 6 A-04-012 A 4 6 A-01-031 A 1 14 A-02-031 A 2 14 A-03-031 A 3 14 A-04-031 A 4 14 A-01-040 A 1 17 A-02-040 A 2 17 A-03-040 A 3 17 A-04-040 A 4 17 . . . . . . . . . . . . B-01-013 B 1 7 B-02-013 B 2 7 B-03-013 B 3 7 B-01-016 B 1 10 B-02-016 B 2 10 B-03-016 B 3 10 B-01-024 B 1 12 B-02-024 B 2 12 B-03-024 B 3 12 . . . . . . . . R-code- ------------- targets_design = readTargets("targets_design.txt") > TD <- factor(paste(targets_design$Trt, targets_design$Dose, sep="_")) > Sibship <- factor(targets_design$SibShip) > design <- model.matrix(~0+TD+Sibship) > fit <- lmFit(ldt, design) Coefficients not estimable: Sibship27 Sibship31 Sibship32 Warning message: Partial NA coefficients for 34127 probe(s) > cont.matrix <- makeContrasts( + TDA_2 - TDA_1, + TDA_3 - TDA_2, + TDA_4 - TDA_3, + TDB_2 - TDB_1, + TDB_3 - TDB_2, + levels = design) > fit1 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit1) > fit2$coefficients[1:5,] Contrasts TDA_2 - TDA_1 TDA_3 - TDA_2 TDA_4 - TDA_3 TDB_2 - TDB_1 TDB_3 - TDB_2 A_23_P146146 -0.2176523 0.14287127 -0.05801898 0.3476315 -0..25312193 A_23_P42935 0.1718808 0.18653560 -0.20015286 -0.2664990 -0..04537665 A_23_P117082 0.1545347 0.32006311 -0.16050816 1.0063268 -1..01438229 A_23_P2683 -0.2549002 -0.16453369 0.27796574 0.2916715 -0..79682996 A_24_P358131 -0.4647673 0.09824839 0.22298962 -0.4026419 0..53349466 When I run the above code taking patient samples for which we have observations on all treatments, it seems to be correct- because logFC values are matching with my calculations. So, my design matrix is correct ??? But, when I include, all the samples (92), logFC values are not matching, because of unbalanced data and LIMMA doesn't ignore non- paired samples, as discussed in https://stat.ethz.ch/pipermail/bioconductor/2011-August/040875.html Should I go ahead with analysis (thinking that design matrix is correct) or is it better to do individual paired t-tests, ignoring data from non-paired samples at each comparison level? Can you suggest an easy way to explain to non-statisticians that why values are not matching. Thanks, Sandhya ________________________________ This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions that are unlawful. This e-mail may contain viruses. Ocimum Biosolutions has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system. OCIMUMBIO SOLUTIONS (P) LTD -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: targets_design.txt URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130408="" 150b318f="" attachment.txt="">

Microarray Normalization GO limma DOSE Microarray Normalization GO limma DOSE • 2.5k views

ADD COMMENT • link updated 12.8 years ago by Ryan C. Thompson ★ 7.9k • written 12.8 years ago by Sandhya Pemmasani Kiran ▴ 40

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 15 months ago

Icahn School of Medicine at Mount Sinai…

Hi Sandhya, When you run limma on the unbalanced design, it will use all the information that you give it, including the unpaired samples. Despite being unpaired, these samples still belong to a TD group and a SibShip group, and will contribute to the fitting of the coefficients for those two groups. It makes sense that including these samples will change the logFC values slightly, because you are providing additional information beyond what you provide in the paired-only case. The computed logFC values are limma's best estimate of the contrasts you request based on all the data that you provide to it, and as long as all the data are good, these should be, on average, better estimates of the contrasts because they are based on strictly more data than the paired-only case. To give a simple analogy, imagine you are measuring a simple variable at two times, such as the height of a sprouting plant at 1 week and 2 weeks. You use five seeds, but one of the sprouts dies at 1.5 weeks, so you have 4 paired observations at 1 & 2 weeks and then one unpaired observation at 1 week. To figure out what the growth rate is between 1 and 2 weeks, you could just use the 4 paired observations based on the surviving plants, and just take the average change. This is the most intuitive thing to do. However, the one unpaired fifth observation is not completely uninformative. This observation contributes to your estimate of the average height of the plant at 1 week, which can in turn influence your estimate of the height change from 1 to 2 weeks. There is not a simple intuitive description of how to factor in this information, as there was for the paired case ("Take the average difference"), but we have a very general tool called linear modeling that is formulated to handle such unbalanced data and make use of *all* that data in a consistent way. The trade-off for using a more general tool is that you are not guaranteed to have that simple arithmetic description of what the coefficients mean in the general case. In your case, with the unbalanced design including the unpaired samples, you can't simply say that the logFC for A2-A1 is the average difference in log signal between A2 and A1 across all sibships. You have to say something like "We fitted a linear model using ordinary least squares regression with coefficients for each TD group and each sibship, and this is the difference between the A2 coefficient and the A1 coefficient." Because in the most general sense, a contrast is just that: a linear combination (i.e. sum or difference) of linear model coefficients. I hope this gives an abstract idea of why the inclusion of these these unpaired samples can, will, and *should* affect the overall logFC estimates, even though those samples by themselves cannot give any information about the logFC value. -Ryan Thompson On 04/07/2013 11:10 PM, Sandhya Pemmasani Kiran wrote: > Dear List: > > We are analyzing Agilent microarray data for a study where samples are related. After Quantile normalization on 'gProcessedSignal', averaging replicate spots and log transformation, we are trying to use LIMMA for differential expression analysis. > > Design is as follows- > 4 Treatment groups - A, B, C and D > 3 Doses per Treatment group, but 4 doses for Treatment A (Total 13 Treatment-Dose combinations) > There are 8 patient samples in each Treatment-Dose combination (Total 104 samples) > > We are interested in comparing Dose effects within Treatments and overlaps across Treatment-Dose combinations. No Treatment comparisons like A vs. B > > Patient samples are related within a Treatment group. But they differ from treatment to treatment. So, this is a nested design, but samples are related/paired. These samples are coming from 32 patients. > > Out of 104 samples, 12 samples failed in Extraction/Hybridization QC and we are currently analyzing 92 samples. We missed few of the paired samples in each Treatment-Dose group. > > Here are the few lines of targets file (attached is full targets file)- > > SampleName Trt Dose SibShip > A-01-001 A 1 1 > A-03-001 A 3 1 > A-04-001 A 4 1 > A-01-012 A 1 6 > A-02-012 A 2 6 > A-04-012 A 4 6 > A-01-031 A 1 14 > A-02-031 A 2 14 > A-03-031 A 3 14 > A-04-031 A 4 14 > A-01-040 A 1 17 > A-02-040 A 2 17 > A-03-040 A 3 17 > A-04-040 A 4 17 > . . . . > . . . . > . . . . > B-01-013 B 1 7 > B-02-013 B 2 7 > B-03-013 B 3 7 > B-01-016 B 1 10 > B-02-016 B 2 10 > B-03-016 B 3 10 > B-01-024 B 1 12 > B-02-024 B 2 12 > B-03-024 B 3 12 > . . . . > . . . . > > R-code- > ------------- > targets_design = readTargets("targets_design.txt") >> TD <- factor(paste(targets_design$Trt, targets_design$Dose, sep="_")) >> Sibship <- factor(targets_design$SibShip) >> design <- model.matrix(~0+TD+Sibship) >> fit <- lmFit(ldt, design) > Coefficients not estimable: Sibship27 Sibship31 Sibship32 > Warning message: > Partial NA coefficients for 34127 probe(s) >> cont.matrix <- makeContrasts( > + TDA_2 - TDA_1, > + TDA_3 - TDA_2, > + TDA_4 - TDA_3, > + TDB_2 - TDB_1, > + TDB_3 - TDB_2, > + levels = design) >> fit1 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit1) >> fit2$coefficients[1:5,] > Contrasts > TDA_2 - TDA_1 TDA_3 - TDA_2 TDA_4 - TDA_3 TDB_2 - TDB_1 TDB_3 - TDB_2 > A_23_P146146 -0.2176523 0.14287127 -0.05801898 0.3476315 -0..25312193 > A_23_P42935 0.1718808 0.18653560 -0.20015286 -0.2664990 -0..04537665 > A_23_P117082 0.1545347 0.32006311 -0.16050816 1.0063268 -1..01438229 > A_23_P2683 -0.2549002 -0.16453369 0.27796574 0.2916715 -0..79682996 > A_24_P358131 -0.4647673 0.09824839 0.22298962 -0.4026419 0..53349466 > > When I run the above code taking patient samples for which we have observations on all treatments, it seems to be correct- because logFC values are matching with my calculations. So, my design matrix is correct ??? > > But, when I include, all the samples (92), logFC values are not matching, because of unbalanced data and LIMMA doesn't ignore non- paired samples, as discussed in > https://stat.ethz.ch/pipermail/bioconductor/2011-August/040875.html > > Should I go ahead with analysis (thinking that design matrix is correct) or is it better to do individual paired t-tests, ignoring data from non-paired samples at each comparison level? > > Can you suggest an easy way to explain to non-statisticians that why values are not matching. > > > > Thanks, > Sandhya > > > > > > ________________________________ > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions that are unlawful. This e-mail may contain viruses. Ocimum Biosolutions has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. > > > The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system. > > OCIMUMBIO SOLUTIONS (P) LTD > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.8 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.