edgeR GLM to adjust for batch effect

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi, I'd like to use a GLM in edgeR to adjust for a batch effect, though only one of my four batches has samples from both groups in the comparisons that I'd like to conduct (pos-nc & neg-nc): 1 2 3 4 pos 3 5 9 0 neg 5 4 7 0 nc 0 0 5 8 I suspect that using a GLM in edgeR to adjust for batch will only work properly if there's representation of both groups from a given comparison in every batch, though would like to know if this is otherwise. I see a batch effect using PVCA on just the pos and neg samples, and would like to try to adjust for it somehow. Please advise. Thanks, Ryan -- output of sessionInfo(): R version 3.0.3 (2014-03-06) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] pvca_1.2.0 beadChipCoreTools_0.49 beadAnno_1.0 lumi_2.14.1 [5] Biobase_2.22.0 BiocGenerics_0.8.0 genefilter_1.44.0 arrayQualityMetrics_3.18.0 [9] edgeR_3.4.2 limma_3.18.12 -- Sent via the guest posting facility at bioconductor.org.

edgeR pvca edgeR pvca • 1.8k views

ADD COMMENT • link updated 10.1 years ago by Ryan C. Thompson ★ 7.9k • written 10.1 years ago by Guest User ★ 13k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

You don't necessarily need every condition in every batch for the comparison to be effective, but having only one batch in common is not good. If I understand correctly, batch 3 would be the dominant contributor to the estimates of fold changes in the comparisons that you care about, since any other change would be mostly absorbed into the batch effects. I think the first step you should take is to fit the full model with conditions and batch effect and find out whether the batch effects appear to be significant enough to warrant inclusion in the model, and if not, then drop them. -Ryan On Wed 26 Mar 2014 03:47:42 PM PDT, Ryan Basom [guest] wrote: > > > Hi, > > I'd like to use a GLM in edgeR to adjust for a batch effect, though > only one of my four batches has samples from both groups in the > comparisons that I'd like to conduct (pos-nc & neg-nc): > > 1 2 3 4 > pos 3 5 9 0 > neg 5 4 7 0 > nc 0 0 5 8 > > I suspect that using a GLM in edgeR to adjust for batch will only work > properly if there's representation of both groups from a given > comparison in every batch, though would like to know if this is > otherwise. I see a batch effect using PVCA on just the pos and neg > samples, and would like to try to adjust for it somehow. Please advise. > > Thanks, > Ryan > > > > > > > -- output of sessionInfo(): > > R version 3.0.3 (2014-03-06) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] splines parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] pvca_1.2.0 beadChipCoreTools_0.49 beadAnno_1.0 lumi_2.14.1 > [5] Biobase_2.22.0 BiocGenerics_0.8.0 genefilter_1.44.0 > arrayQualityMetrics_3.18.0 > [9] edgeR_3.4.2 limma_3.18.12 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks for this advice. I have a follow up question though: As described in the edgeR User's Guide pertaining to adjusting for batch effects "In this type of analysis, the treatments are compared only within each batch. The analysis is corrected for baseline differences between the batches." If some of the batches don't have samples for say both treatments, how is this compensated for? Though this isn't ideal, I'd like to get a better sense of what's going on in this scenario. Thanks, Ryan On 03/26/2014 04:36 PM, Ryan C. Thompson wrote: > You don't necessarily need every condition in every batch for the > comparison to be effective, but having only one batch in common is not > good. If I understand correctly, batch 3 would be the dominant > contributor to the estimates of fold changes in the comparisons that > you care about, since any other change would be mostly absorbed into > the batch effects. I think the first step you should take is to fit > the full model with conditions and batch effect and find out whether > the batch effects appear to be significant enough to warrant inclusion > in the model, and if not, then drop them. > > -Ryan > > On Wed 26 Mar 2014 03:47:42 PM PDT, Ryan Basom [guest] wrote: >> >> >> Hi, >> >> I'd like to use a GLM in edgeR to adjust for a batch effect, though >> only one of my four batches has samples from both groups in the >> comparisons that I'd like to conduct (pos-nc & neg-nc): >> >> 1 2 3 4 >> pos 3 5 9 0 >> neg 5 4 7 0 >> nc 0 0 5 8 >> >> I suspect that using a GLM in edgeR to adjust for batch will only >> work properly if there's representation of both groups from a given >> comparison in every batch, though would like to know if this is >> otherwise. I see a batch effect using PVCA on just the pos and neg >> samples, and would like to try to adjust for it somehow. Please advise. >> >> Thanks, >> Ryan >> >> >> >> >> >> >> -- output of sessionInfo(): >> >> R version 3.0.3 (2014-03-06) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 >> LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] splines parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] pvca_1.2.0 beadChipCoreTools_0.49 beadAnno_1.0 lumi_2.14.1 >> [5] Biobase_2.22.0 BiocGenerics_0.8.0 genefilter_1.44.0 >> arrayQualityMetrics_3.18.0 >> [9] edgeR_3.4.2 limma_3.18.12 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.1 years ago Ryan Basom ▴ 20

0

Entering edit mode

If you only had two conditions (instead of 3) and only a single batch had samples from both conditions, then you would be completely unable to dissociate batch effects from treatment effects, and your treatment fold change would be entirely determined by the one batch with both conditions in it. (The other batches would still contribute to dispersion estimation.) However, in your case, you have a third treatment, which means that treatment and batch are not completely confounded, except in the case of batch 4 which has only a single treatment. By my understanding, in this model, batch 3 will be solely responsible for determining the estimate of fold change between nc and the mean of pos & neg, while batches 1 & 2 batches will also contribute to the fold change between pos and neg. Batch 4 will not contribute directly to any estimate of fold change between treatments. Overall, I would be quite uncomfortable including a batch effect in my model for this data, and I would search for evidence that the batch effect is non-significant. It might be appropriate to estimate dispersions with a batch effect included and then drop the batch effect for the model fitting step, but I'm not confident about the statistical validity of such an approach. This would inflate your significance measures relative to leaving out the batch effect entirely, so it may end up being anti-conservative. -Ryan On 03/27/2014 02:51 PM, Ryan Basom wrote: > Thanks for this advice. I have a follow up question though: As > described in the edgeR User's Guide pertaining to adjusting for batch > effects "In this type of analysis, the treatments are compared only > within each batch. The analysis is corrected for baseline differences > between the batches." If some of the batches don't have samples for > say both treatments, how is this compensated for? Though this isn't > ideal, I'd like to get a better sense of what's going on in this > scenario. > > Thanks, > Ryan > > > On 03/26/2014 04:36 PM, Ryan C. Thompson wrote: >> You don't necessarily need every condition in every batch for the >> comparison to be effective, but having only one batch in common is >> not good. If I understand correctly, batch 3 would be the dominant >> contributor to the estimates of fold changes in the comparisons that >> you care about, since any other change would be mostly absorbed into >> the batch effects. I think the first step you should take is to fit >> the full model with conditions and batch effect and find out whether >> the batch effects appear to be significant enough to warrant >> inclusion in the model, and if not, then drop them. >> >> -Ryan >> >> On Wed 26 Mar 2014 03:47:42 PM PDT, Ryan Basom [guest] wrote: >>> >>> >>> Hi, >>> >>> I'd like to use a GLM in edgeR to adjust for a batch effect, though >>> only one of my four batches has samples from both groups in the >>> comparisons that I'd like to conduct (pos-nc & neg-nc): >>> >>> 1 2 3 4 >>> pos 3 5 9 0 >>> neg 5 4 7 0 >>> nc 0 0 5 8 >>> >>> I suspect that using a GLM in edgeR to adjust for batch will only >>> work properly if there's representation of both groups from a given >>> comparison in every batch, though would like to know if this is >>> otherwise. I see a batch effect using PVCA on just the pos and neg >>> samples, and would like to try to adjust for it somehow. Please advise. >>> >>> Thanks, >>> Ryan >>> >>> >>> >>> >>> >>> >>> -- output of sessionInfo(): >>> >>> R version 3.0.3 (2014-03-06) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 >>> LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >>> LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] splines parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] pvca_1.2.0 beadChipCoreTools_0.49 beadAnno_1.0 lumi_2.14.1 >>> [5] Biobase_2.22.0 BiocGenerics_0.8.0 genefilter_1.44.0 >>> arrayQualityMetrics_3.18.0 >>> [9] edgeR_3.4.2 limma_3.18.12 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.1 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.