normalization and batch correction across multiple project
1
0
Entering edit mode
@adaikalavan-ramasamy-5765
Last seen 10.1 years ago
United Kingdom
Dear Prof. Smyth, Apologies for the late reply. Thank you so much for valuable input and yes reproducibility of results is paramount. As you indicated, we do sometimes have samples originating from different origins (e.g. PBMC vs. PAXgene) or different countries where sample collection/storage varies which would complicate matters. I tried removeBatchEffect() and it works well. The example project I tried had four batches, one of which consisted of only single sample. How is removeBatchEffects() able to deal with single sample batches when ComBat does not? Here is an example: expr <- matrix( 10 + rnorm(10000*100), nc=100 ) batch <- paste0("Batch", rep( c(1, 2, 3, 4), c(33, 33, 33, 1) ) ) tissue <- sample(c("A", "B", "C"), 100, replace=T) table(batch, tissue) # tissue # batch A B C # Batch1 11 9 13 # Batch2 12 11 10 # Batch3 14 9 10 # Batch4 0 1 0 mod <- model.matrix( ~ tissue) ## I want to preserve variation due to tissue resid <- ComBat(dat=expr, batch=batch, mod=mod) # Found 4 batches # Found 2 categorical covariate(s) # Standardizing Data across genes # Fitting L/S model and finding priors # Error in apply(s.data[, i], 1, var, na.rm = T) : # dim(X) must have a positive length out <- removeBatchEffect( expr, batch=batch, design=mod ) ## works! Should I post this as a separate email? Thank you. Regards, Adai On Thu, Aug 28, 2014 at 2:01 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > We have had to regularly address the same issues that you are facing. > There is no blanket answer -- every case needs to be considered on its own > merits -- but you seem to be considering the right options. > > In our work, we generally adjust for the batch in the limma linear model > rather than trying to remove it up-front using combat. Also consider > removeBatchEffect(). > > As you say, analysing multiple projects together can help estimate a batch > effect. However this approach will come unstuck if the samples for the > projects are very different. There is another reason why we generally > avoid analysing multiple projects together. The projects will usually need > to be submitted eventually to a public repository such as GEO, and the > different projects generally have to be submitted independently. Users will > not be able to reproduce our normalization and analysis unless the projects > are analyzed separately. > > Best wishes > Gordon > > On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote: >> >> Dear all, >> >> I would like to appeal to the collective wisdom in this group on how best >> to solve this problem of normalization and batch correction. >> >> We are a service unit for an academic institute and we run several >> projects simultaneously. We use Illumina HT12-v4 microarrays which can >> take up to 12 different samples per chip. As we QC the data from one >> project, the RNA from failed samples can be repeated to include into chips >> from another project (rather than running partial chips to avoid wastage). >> Sometimes we include samples from other projects also. Here is a simple >> illustration >> >> Chip No ScanDate Contents >> 1 1st July *12 samples from project A* >> 2 1st July *8 samples from project A* + 4 from >> project B >> 3 1st August 12 samples from Project B >> 4 1st August *1 sample from Project A* + 5 samples >> from B + 6 from project C >> ... >> >> What is the best way to prepare the final data for *project A*? One >> option is to do the following: >> >> 1. Pool chips 1, 2 and 4 together. >> 2. Remove failed samples >> 3. Remove samples from other projects. >> 4. Normalize using NEQC from limma >> 5. Correct for scan date using COMBAT from sva. >> >> The other option we considered is to omit step 3 (i.e. use other samples >> for normalization and COMBAT) and subset at the end. >> >> I feel this second option allows for better estimation of batch effects >> (especially in chip 4). However, sometimes project A and B can be quite >> different (e.g. samples derived from different tissues) which might mess >> up >> the normalization especially if we want to compare project A to B >> directly. We >> also considered nec() followed by normalizeBetweenArrays with "Tquantile" >> but I felt it was too complicated. Anything else to try? >> >> Thank you. >> >> -- >> >> Adaikalavan Ramasamy >> >> Senior Leadership Fellow in Bioinformatics >> >> Head of the Transcriptomics Core Facility >> >> >> >> Email: adaikalavan.ramasamy at ndm.ox.ac.uk >> >> Office: 01865 287 710 >> >> Mob: 07906 308 465 >> >> http://www.jenner.ac.uk/transcriptomics-facility >> > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}
Normalization limma sva • 2.2k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 43 minutes ago
WEHI, Melbourne, Australia
Dear Adai, The removeBatchEffects() function is written in such a way that it always runs without an error. It makes fewer assumptions than Combat, and just does the best it can with whatever data you give it. Nevertheless, I would question the value of having a batch that contains only one sample. Better to pool this and another batch. Best wishes Gordon On Mon, 8 Sep 2014, Adaikalavan Ramasamy wrote: > Dear Prof. Smyth, > > Apologies for the late reply. Thank you so much for valuable input and yes > reproducibility of results is paramount. As you indicated, we do sometimes > have samples originating from different origins (e.g. PBMC vs. PAXgene) or > different countries where sample collection/storage varies which would > complicate matters. > > I tried removeBatchEffect() and it works well. The example project I tried > had four batches, one of which consisted of only single sample. How is > removeBatchEffects() able to deal with single sample batches when ComBat > does not? Here is an example: > > expr <- matrix( 10 + rnorm(10000*100), nc=100 ) > batch <- paste0("Batch", rep( c(1, 2, 3, 4), c(33, 33, 33, 1) ) ) > tissue <- sample(c("A", "B", "C"), 100, replace=T) > > table(batch, tissue) > # tissue > # batch A B C > # Batch1 11 9 13 > # Batch2 12 11 10 > # Batch3 14 9 10 > # Batch4 0 1 0 > > mod <- model.matrix( ~ tissue) ## I want to preserve variation due to > tissue > > > resid <- ComBat(dat=expr, batch=batch, mod=mod) > # Found 4 batches > # Found 2 categorical covariate(s) > # Standardizing Data across genes > # Fitting L/S model and finding priors > # Error in apply(s.data[, i], 1, var, na.rm = T) : > # dim(X) must have a positive length > > > out <- removeBatchEffect( expr, batch=batch, design=mod ) ## works! > > Should I post this as a separate email? Thank you. > > Regards, Adai > > > > On Thu, Aug 28, 2014 at 2:01 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> We have had to regularly address the same issues that you are facing. >> There is no blanket answer -- every case needs to be considered on its own >> merits -- but you seem to be considering the right options. >> >> In our work, we generally adjust for the batch in the limma linear model >> rather than trying to remove it up-front using combat. Also consider >> removeBatchEffect(). >> >> As you say, analysing multiple projects together can help estimate a batch >> effect. However this approach will come unstuck if the samples for the >> projects are very different. There is another reason why we generally >> avoid analysing multiple projects together. The projects will usually need >> to be submitted eventually to a public repository such as GEO, and the >> different projects generally have to be submitted independently. Users will >> not be able to reproduce our normalization and analysis unless the projects >> are analyzed separately. >> >> Best wishes >> Gordon >> >> On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote: >>> >>> Dear all, >>> >>> I would like to appeal to the collective wisdom in this group on how best >>> to solve this problem of normalization and batch correction. >>> >>> We are a service unit for an academic institute and we run several >>> projects simultaneously. We use Illumina HT12-v4 microarrays which can >>> take up to 12 different samples per chip. As we QC the data from one >>> project, the RNA from failed samples can be repeated to include into chips >>> from another project (rather than running partial chips to avoid wastage). >>> Sometimes we include samples from other projects also. Here is a simple >>> illustration >>> >>> Chip No ScanDate Contents >>> 1 1st July *12 samples from project A* >>> 2 1st July *8 samples from project A* + 4 from >>> project B >>> 3 1st August 12 samples from Project B >>> 4 1st August *1 sample from Project A* + 5 samples >>> from B + 6 from project C >>> ... >>> >>> What is the best way to prepare the final data for *project A*? One >>> option is to do the following: >>> >>> 1. Pool chips 1, 2 and 4 together. >>> 2. Remove failed samples >>> 3. Remove samples from other projects. >>> 4. Normalize using NEQC from limma >>> 5. Correct for scan date using COMBAT from sva. >>> >>> The other option we considered is to omit step 3 (i.e. use other samples >>> for normalization and COMBAT) and subset at the end. >>> >>> I feel this second option allows for better estimation of batch effects >>> (especially in chip 4). However, sometimes project A and B can be quite >>> different (e.g. samples derived from different tissues) which might mess >>> up >>> the normalization especially if we want to compare project A to B >>> directly. We >>> also considered nec() followed by normalizeBetweenArrays with "Tquantile" >>> but I felt it was too complicated. Anything else to try? >>> >>> Thank you. >>> >>> -- >>> >>> Adaikalavan Ramasamy >>> >>> Senior Leadership Fellow in Bioinformatics >>> >>> Head of the Transcriptomics Core Facility >>> >>> >>> >>> Email: adaikalavan.ramasamy at ndm.ox.ac.uk >>> >>> Office: 01865 287 710 >>> >>> Mob: 07906 308 465 >>> >>> http://www.jenner.ac.uk/transcriptomics-facility >>> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Thank you for the explanation. To clarify, this batch contains 1 sample from this project and 11 samples from an unrelated project. So I extracted this one sample and added to others from this project. On 9 Sep 2014 03:16, "Gordon K Smyth" <smyth at="" wehi.edu.au=""> wrote: > Dear Adai, > > The removeBatchEffects() function is written in such a way that it always > runs without an error. It makes fewer assumptions than Combat, and just > does the best it can with whatever data you give it. > > Nevertheless, I would question the value of having a batch that contains > only one sample. Better to pool this and another batch. > > Best wishes > Gordon > > > On Mon, 8 Sep 2014, Adaikalavan Ramasamy wrote: > > Dear Prof. Smyth, >> >> Apologies for the late reply. Thank you so much for valuable input and yes >> reproducibility of results is paramount. As you indicated, we do sometimes >> have samples originating from different origins (e.g. PBMC vs. PAXgene) or >> different countries where sample collection/storage varies which would >> complicate matters. >> >> I tried removeBatchEffect() and it works well. The example project I tried >> had four batches, one of which consisted of only single sample. How is >> removeBatchEffects() able to deal with single sample batches when ComBat >> does not? Here is an example: >> >> expr <- matrix( 10 + rnorm(10000*100), nc=100 ) >> batch <- paste0("Batch", rep( c(1, 2, 3, 4), c(33, 33, 33, 1) ) ) >> tissue <- sample(c("A", "B", "C"), 100, replace=T) >> >> table(batch, tissue) >> # tissue >> # batch A B C >> # Batch1 11 9 13 >> # Batch2 12 11 10 >> # Batch3 14 9 10 >> # Batch4 0 1 0 >> >> mod <- model.matrix( ~ tissue) ## I want to preserve variation due to >> tissue >> >> >> resid <- ComBat(dat=expr, batch=batch, mod=mod) >> # Found 4 batches >> # Found 2 categorical covariate(s) >> # Standardizing Data across genes >> # Fitting L/S model and finding priors >> # Error in apply(s.data[, i], 1, var, na.rm = T) : >> # dim(X) must have a positive length >> >> >> out <- removeBatchEffect( expr, batch=batch, design=mod ) ## works! >> >> Should I post this as a separate email? Thank you. >> >> Regards, Adai >> >> >> >> On Thu, Aug 28, 2014 at 2:01 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> wrote: >> >> We have had to regularly address the same issues that you are facing. >>> There is no blanket answer -- every case needs to be considered on its >>> own >>> merits -- but you seem to be considering the right options. >>> >>> In our work, we generally adjust for the batch in the limma linear model >>> rather than trying to remove it up-front using combat. Also consider >>> removeBatchEffect(). >>> >>> As you say, analysing multiple projects together can help estimate a >>> batch >>> effect. However this approach will come unstuck if the samples for the >>> projects are very different. There is another reason why we generally >>> avoid analysing multiple projects together. The projects will usually >>> need >>> to be submitted eventually to a public repository such as GEO, and the >>> different projects generally have to be submitted independently. Users >>> will >>> not be able to reproduce our normalization and analysis unless the >>> projects >>> are analyzed separately. >>> >>> Best wishes >>> Gordon >>> >>> On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote: >>> >>>> >>>> Dear all, >>>> >>>> I would like to appeal to the collective wisdom in this group on how >>>> best >>>> to solve this problem of normalization and batch correction. >>>> >>>> We are a service unit for an academic institute and we run several >>>> projects simultaneously. We use Illumina HT12-v4 microarrays which can >>>> take up to 12 different samples per chip. As we QC the data from one >>>> project, the RNA from failed samples can be repeated to include into >>>> chips >>>> from another project (rather than running partial chips to avoid >>>> wastage). >>>> Sometimes we include samples from other projects also. Here is a simple >>>> illustration >>>> >>>> Chip No ScanDate Contents >>>> 1 1st July *12 samples from project A* >>>> 2 1st July *8 samples from project A* + 4 from >>>> project B >>>> 3 1st August 12 samples from Project B >>>> 4 1st August *1 sample from Project A* + 5 samples >>>> from B + 6 from project C >>>> ... >>>> >>>> What is the best way to prepare the final data for *project A*? One >>>> option is to do the following: >>>> >>>> 1. Pool chips 1, 2 and 4 together. >>>> 2. Remove failed samples >>>> 3. Remove samples from other projects. >>>> 4. Normalize using NEQC from limma >>>> 5. Correct for scan date using COMBAT from sva. >>>> >>>> The other option we considered is to omit step 3 (i.e. use other samples >>>> for normalization and COMBAT) and subset at the end. >>>> >>>> I feel this second option allows for better estimation of batch effects >>>> (especially in chip 4). However, sometimes project A and B can be quite >>>> different (e.g. samples derived from different tissues) which might mess >>>> up >>>> the normalization especially if we want to compare project A to B >>>> directly. We >>>> also considered nec() followed by normalizeBetweenArrays with >>>> "Tquantile" >>>> but I felt it was too complicated. Anything else to try? >>>> >>>> Thank you. >>>> >>>> -- >>>> >>>> Adaikalavan Ramasamy >>>> >>>> Senior Leadership Fellow in Bioinformatics >>>> >>>> Head of the Transcriptomics Core Facility >>>> >>>> >>>> >>>> Email: adaikalavan.ramasamy at ndm.ox.ac.uk >>>> >>>> Office: 01865 287 710 >>>> >>>> Mob: 07906 308 465 >>>> >>>> http://www.jenner.ac.uk/transcriptomics-facility >>>> >>>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >>> >> > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}
ADD REPLY

Login before adding your answer.

Traffic: 561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6