normalization and batch correction across multiple project

0

Entering edit mode

Adaikalavan Ramasamy ▴ 220

@adaikalavan-ramasamy-5765

Last seen 9.5 years ago

United Kingdom

Dear all, I would like to appeal to the collective wisdom in this group on how best to solve this problem of normalization and batch correction. We are a service unit for an academic institute and we run several projects simultaneously. We use Illumina HT12-v4 microarrays which can take up to 12 different samples per chip. As we QC the data from one project, the RNA from failed samples can be repeated to include into chips from another project (rather than running partial chips to avoid wastage). Sometimes we include samples from other projects also. Here is a simple illustration Chip No ScanDate Contents 1 1st July *12 samples from project A* 2 1st July *8 samples from project A* + 4 from project B 3 1st August 12 samples from Project B 4 1st August *1 sample from Project A* + 5 samples from B + 6 from project C ... What is the best way to prepare the final data for *project A*? One option is to do the following: 1. Pool chips 1, 2 and 4 together. 2. Remove failed samples 3. Remove samples from other projects. 4. Normalize using NEQC from limma 5. Correct for scan date using COMBAT from sva. The other option we considered is to omit step 3 (i.e. use other samples for normalization and COMBAT) and subset at the end. I feel this second option allows for better estimation of batch effects (especially in chip 4). However, sometimes project A and B can be quite different (e.g. samples derived from different tissues) which might mess up the normalization especially if we want to compare project A to B directly. We also considered nec() followed by normalizeBetweenArrays with "Tquantile" but I felt it was too complicated. Anything else to try? Thank you. -- Adaikalavan Ramasamy Senior Leadership Fellow in Bioinformatics Head of the Transcriptomics Core Facility Email: adaikalavan.ramasamy at ndm.ox.ac.uk Office: 01865 287 710 Mob: 07906 308 465 http://www.jenner.ac.uk/transcriptomics-facility [[alternative HTML version deleted]]

Normalization sva Normalization sva • 1.3k views

ADD COMMENT • link updated 9.7 years ago by Gordon Smyth 50k • written 9.7 years ago by Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi Adaikalavan, Why not try it both ways and see if it even makes a difference? If you get the same results either way, then just do whatever is easier. If you do batch correction before removing other projects' samples, I would think you would need to include the project identifier as a batch effect in addition to the scan date or chip number, right? -Ryan On 8/18/14, 5:11 AM, Adaikalavan Ramasamy wrote: > Dear all, > > I would like to appeal to the collective wisdom in this group on how best > to solve this problem of normalization and batch correction. > > We are a service unit for an academic institute and we run several projects > simultaneously. We use Illumina HT12-v4 microarrays which can take up to 12 > different samples per chip. As we QC the data from one project, the RNA > from failed samples can be repeated to include into chips from another > project (rather than running partial chips to avoid wastage). Sometimes we > include samples from other projects also. Here is a simple illustration > > Chip No ScanDate Contents > 1 1st July *12 samples from project A* > 2 1st July *8 samples from project A* + 4 from > project B > 3 1st August 12 samples from Project B > 4 1st August *1 sample from Project A* + 5 samples from > B + 6 from project C > ... > > What is the best way to prepare the final data for *project A*? One option > is to do the following: > > 1. Pool chips 1, 2 and 4 together. > 2. Remove failed samples > 3. Remove samples from other projects. > 4. Normalize using NEQC from limma > 5. Correct for scan date using COMBAT from sva. > > The other option we considered is to omit step 3 (i.e. use other samples > for normalization and COMBAT) and subset at the end. > > I feel this second option allows for better estimation of batch effects > (especially in chip 4). However, sometimes project A and B can be quite > different (e.g. samples derived from different tissues) which might mess up > the normalization especially if we want to compare project A to B directly. We > also considered nec() followed by normalizeBetweenArrays with "Tquantile" > but I felt it was too complicated. Anything else to try? > > Thank you. > > -- > > Adaikalavan Ramasamy > > Senior Leadership Fellow in Bioinformatics > > Head of the Transcriptomics Core Facility > > > > Email: adaikalavan.ramasamy at ndm.ox.ac.uk > > Office: 01865 287 710 > > Mob: 07906 308 465 > > http://www.jenner.ac.uk/transcriptomics-facility > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 9.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Dear Ryan, Thank you for the advice. I am happy to do it both ways but these are large projects and also it would be difficult to quantify if the differences are small enough. Which is why I wanted to get the opinion of others in this list. And yes, you are right in that we need to include project and scan date into the adjustment if I batch correct first. Thanks. Regards, Adai On Tue, Aug 26, 2014 at 6:31 PM, Ryan <rct at="" thompsonclan.org=""> wrote: > Hi Adaikalavan, > > Why not try it both ways and see if it even makes a difference? If you get > the same results either way, then just do whatever is easier. > > If you do batch correction before removing other projects' samples, I > would think you would need to include the project identifier as a batch > effect in addition to the scan date or chip number, right? > > -Ryan > > > On 8/18/14, 5:11 AM, Adaikalavan Ramasamy wrote: > >> Dear all, >> >> I would like to appeal to the collective wisdom in this group on how best >> to solve this problem of normalization and batch correction. >> >> We are a service unit for an academic institute and we run several >> projects >> simultaneously. We use Illumina HT12-v4 microarrays which can take up to >> 12 >> different samples per chip. As we QC the data from one project, the RNA >> from failed samples can be repeated to include into chips from another >> project (rather than running partial chips to avoid wastage). Sometimes we >> include samples from other projects also. Here is a simple illustration >> >> Chip No ScanDate Contents >> 1 1st July *12 samples from project A* >> 2 1st July *8 samples from project A* + 4 from >> >> project B >> 3 1st August 12 samples from Project B >> 4 1st August *1 sample from Project A* + 5 samples from >> >> B + 6 from project C >> ... >> >> What is the best way to prepare the final data for *project A*? One option >> >> is to do the following: >> >> 1. Pool chips 1, 2 and 4 together. >> 2. Remove failed samples >> 3. Remove samples from other projects. >> 4. Normalize using NEQC from limma >> 5. Correct for scan date using COMBAT from sva. >> >> >> The other option we considered is to omit step 3 (i.e. use other samples >> for normalization and COMBAT) and subset at the end. >> >> I feel this second option allows for better estimation of batch effects >> (especially in chip 4). However, sometimes project A and B can be quite >> different (e.g. samples derived from different tissues) which might mess >> up >> the normalization especially if we want to compare project A to B >> directly. We >> also considered nec() followed by normalizeBetweenArrays with "Tquantile" >> but I felt it was too complicated. Anything else to try? >> >> Thank you. >> >> -- >> >> Adaikalavan Ramasamy >> >> Senior Leadership Fellow in Bioinformatics >> >> Head of the Transcriptomics Core Facility >> >> >> >> Email: adaikalavan.ramasamy at ndm.ox.ac.uk >> >> Office: 01865 287 710 >> >> Mob: 07906 308 465 >> >> http://www.jenner.ac.uk/transcriptomics-facility >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 9.7 years ago Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Adaikalavan Ramasamy ▴ 220

@adaikalavan-ramasamy-5765

Last seen 9.5 years ago

United Kingdom

Dear all, I had no response to this email that I sent last week. If anyone has any input, I would greatly appreciate it. Thank you. Regards, Adai On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy < adaikalavan.ramasamy at gmail.com> wrote: > Dear all, > > I would like to appeal to the collective wisdom in this group on how best > to solve this problem of normalization and batch correction. > > We are a service unit for an academic institute and we run several > projects simultaneously. We use Illumina HT12-v4 microarrays which can > take up to 12 different samples per chip. As we QC the data from one > project, the RNA from failed samples can be repeated to include into chips > from another project (rather than running partial chips to avoid wastage). > Sometimes we include samples from other projects also. Here is a simple > illustration > > Chip No ScanDate Contents > 1 1st July *12 samples from project A* > 2 1st July *8 samples from project A* + 4 from > project B > 3 1st August 12 samples from Project B > 4 1st August *1 sample from Project A* + 5 samples > from B + 6 from project C > ... > > What is the best way to prepare the final data for *project A*? One > option is to do the following: > > 1. Pool chips 1, 2 and 4 together. > 2. Remove failed samples > 3. Remove samples from other projects. > 4. Normalize using NEQC from limma > 5. Correct for scan date using COMBAT from sva. > > The other option we considered is to omit step 3 (i.e. use other samples > for normalization and COMBAT) and subset at the end. > > I feel this second option allows for better estimation of batch effects > (especially in chip 4). However, sometimes project A and B can be quite > different (e.g. samples derived from different tissues) which might mess up > the normalization especially if we want to compare project A to B directly. We > also considered nec() followed by normalizeBetweenArrays with "Tquantile" > but I felt it was too complicated. Anything else to try? > > Thank you. > > -- > > Adaikalavan Ramasamy > > Senior Leadership Fellow in Bioinformatics > > Head of the Transcriptomics Core Facility > > > > Email: adaikalavan.ramasamy at ndm.ox.ac.uk > > Office: 01865 287 710 > > Mob: 07906 308 465 > > http://www.jenner.ac.uk/transcriptomics-facility > > > > > > > [[alternative HTML version deleted]]

ADD COMMENT • link 9.7 years ago Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 7 minutes ago

WEHI, Melbourne, Australia

We have had to regularly address the same issues that you are facing. There is no blanket answer -- every case needs to be considered on its own merits -- but you seem to be considering the right options. In our work, we generally adjust for the batch in the limma linear model rather than trying to remove it up-front using combat. Also consider removeBatchEffect(). As you say, analysing multiple projects together can help estimate a batch effect. However this approach will come unstuck if the samples for the projects are very different. There is another reason why we generally avoid analysing multiple projects together. The projects will usually need to be submitted eventually to a public repository such as GEO, and the different projects generally have to be submitted independently. Users will not be able to reproduce our normalization and analysis unless the projects are analyzed separately. Best wishes Gordon > On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote: > > Dear all, > > I would like to appeal to the collective wisdom in this group on how best > to solve this problem of normalization and batch correction. > > We are a service unit for an academic institute and we run several > projects simultaneously. We use Illumina HT12-v4 microarrays which can > take up to 12 different samples per chip. As we QC the data from one > project, the RNA from failed samples can be repeated to include into chips > from another project (rather than running partial chips to avoid wastage). > Sometimes we include samples from other projects also. Here is a simple > illustration > > Chip No ScanDate Contents > 1 1st July *12 samples from project A* > 2 1st July *8 samples from project A* + 4 from > project B > 3 1st August 12 samples from Project B > 4 1st August *1 sample from Project A* + 5 samples > from B + 6 from project C > ... > > What is the best way to prepare the final data for *project A*? One > option is to do the following: > > 1. Pool chips 1, 2 and 4 together. > 2. Remove failed samples > 3. Remove samples from other projects. > 4. Normalize using NEQC from limma > 5. Correct for scan date using COMBAT from sva. > > The other option we considered is to omit step 3 (i.e. use other samples > for normalization and COMBAT) and subset at the end. > > I feel this second option allows for better estimation of batch effects > (especially in chip 4). However, sometimes project A and B can be quite > different (e.g. samples derived from different tissues) which might mess up > the normalization especially if we want to compare project A to B directly. We > also considered nec() followed by normalizeBetweenArrays with "Tquantile" > but I felt it was too complicated. Anything else to try? > > Thank you. > > -- > > Adaikalavan Ramasamy > > Senior Leadership Fellow in Bioinformatics > > Head of the Transcriptomics Core Facility > > > > Email: adaikalavan.ramasamy at ndm.ox.ac.uk > > Office: 01865 287 710 > > Mob: 07906 308 465 > > http://www.jenner.ac.uk/transcriptomics-facility ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 9.7 years ago Gordon Smyth 50k

Login before adding your answer.