Question

ComBat: Could it utilize technical replicates?

3

Entering edit mode

Essi Laajala ▴ 50

@essi-laajala-6085

Last seen 9.6 years ago

Hi, I'm dealing with quite an unusual study design. Originally (due to unfortunate and inevitable circumstances) we had all "high_risk" and "medium_risk" samples on batch 1 and "low_risk" samples on batch 2. Then we discussed the batch effect and decided to re-hybridize some randomly selected samples from each risk group on batch 3. The resulting study design looks a bit like like this (in reality we have 30 - 45 samples in each group and 16 samples re-hybridized but you'll get an idea): Array name Batch Covariate 1 sample_A 1 High_risk sample_B 1 High_risk sample_C 1 High_risk sample_C_2 3 High_risk sample_D 1 High_risk sample_D_2 3 High_risk sample_E 1 Medium_risk sample_F 1 Medium_risk sample_G 1 Medium_risk sample_G_2 3 Medium_risk sample_H 2 Low_risk sample_I 2 Low_risk sample_J 2 Low_risk sample_J_2 3 Low_risk sample_K 2 Low_risk sample_K_2 3 Low_risk For example Sample_C and Sample_C_2 are the same RNA sample and the only difference between them is the batch (the same applies to D and D_2 etc.). Such array pairs should be valuable for estimating batch effects. The question is: Can ComBat utilize this information? Or can you recommend some other batch correction method that could? For now, I've applied ComBat after removing the replicated samples on batches 1 and 2 (C, D, G, J and K in the above example) but this is certainly not an optimal solution. Best regards, Essi Laajala PhD student in bioinformatics Turku, Finland [[alternative HTML version deleted]]

• 2.0k views

ADD COMMENT • link updated 10.7 years ago by W. Evan Johnson ▴ 850 • written 10.7 years ago by Essi Laajala ▴ 50

score 3 · Answer 1 · 2013-08-10

3

Entering edit mode

W. Evan Johnson ▴ 850

@w-evan-johnson-5447

Last seen 5 days ago

United States

Hi Essi, Yes, ComBat can definitely utilize this information. Just replace your current 'Covariate 1' with a covariate that just has the sample letter (e.g. A, B, C, C, D, D, E, ... ). Note that this will be sufficient because your 'Covariate 1' is nested within sample letter. Under this setup, ComBat will preserve all variation due to sample type (and as a result risk level) and effectively just use the repeated samples to adjust for batch. Hope this helps. Thanks! Evan On Aug 9, 2013, at 8:23 AM, Essi Laajala wrote: > Hi, > > I'm dealing with quite an unusual study design. Originally (due to unfortunate and inevitable circumstances) we had all "high_risk" and "medium_risk" samples on batch 1 and "low_risk" samples on batch 2. Then we discussed the batch effect and decided to re-hybridize some randomly selected samples from each risk group on batch 3. The resulting study design looks a bit like like this (in reality we have 30 - 45 samples in each group and 16 samples re-hybridized but you'll get an idea): > > Array name Batch Covariate 1 > sample_A 1 High_risk > sample_B 1 High_risk > sample_C 1 High_risk > sample_C_2 3 High_risk > sample_D 1 High_risk > sample_D_2 3 High_risk > sample_E 1 Medium_risk > sample_F 1 Medium_risk > sample_G 1 Medium_risk > sample_G_2 3 Medium_risk > sample_H 2 Low_risk > sample_I 2 Low_risk > sample_J 2 Low_risk > sample_J_2 3 Low_risk > sample_K 2 Low_risk > sample_K_2 3 Low_risk > > For example Sample_C and Sample_C_2 are the same RNA sample and the only difference between them is the batch (the same applies to D and D_2 etc.). Such array pairs should be valuable for estimating batch effects. The question is: Can ComBat utilize this information? Or can you recommend some other batch correction method that could? For now, I've applied ComBat after removing the replicated samples on batches 1 and 2 (C, D, G, J and K in the above example) but this is certainly not an optimal solution. > > Best regards, > > Essi Laajala > PhD student in bioinformatics > Turku, Finland >

ADD COMMENT • link 10.7 years ago W. Evan Johnson ▴ 850

0

Entering edit mode

Dear Evan, Thank you for your message! That's a good plan but don't you think it should lead to the singularity error? At least that's what it does to me. I've been using the old ComBat script and now I tried the Bioconductor version as well. I've attached the real sample_info file. (Sorry it's a bit complicated: there are actually 4 batches and batch 4 is the one with the re-hybridizations. I have 113 samples but 15 are re-hybridized so altogether 128 arrays. The "high risk" group has the label E, "medium risk" is T and "low risk" is P.) Here's what I did with the Bioconductor ComBat: > library(sva) > b <- sample_info[,"Batch"] > mm <- model.matrix(~as.factor(Covariate1), data=sample_info) > data_combat <- ComBat(exprs_data, b, mm) Found 4 batches Found 112 categorical covariate(s) Standardizing Data across genes Error in solve.default(t(design) %*% design) : Lapack routine dgesv: system is exactly singular: U[51,51] = 0 Best regards, Essi On Sun, Aug 11, 2013 at 2:57 AM, Johnson, William Evan <wej at="" bu.edu=""> wrote: > Hi Essi, > > Yes, ComBat can definitely utilize this information. Just replace your > current 'Covariate 1' with a covariate that just has the sample letter > (e.g. A, B, C, C, D, D, E, ... ). Note that this will be sufficient because > your 'Covariate 1' is nested within sample letter. Under this setup, ComBat > will preserve all variation due to sample type (and as a result risk level) > and effectively just use the repeated samples to adjust for batch. > > Hope this helps. Thanks! > > Evan > > > On Aug 9, 2013, at 8:23 AM, Essi Laajala wrote: > > > Hi, > > > > I'm dealing with quite an unusual study design. Originally (due to > unfortunate and inevitable circumstances) we had all "high_risk" and > "medium_risk" samples on batch 1 and "low_risk" samples on batch 2. Then we > discussed the batch effect and decided to re-hybridize some randomly > selected samples from each risk group on batch 3. The resulting study > design looks a bit like like this (in reality we have 30 - 45 samples in > each group and 16 samples re-hybridized but you'll get an idea): > > > > Array name Batch Covariate 1 > > sample_A 1 High_risk > > sample_B 1 High_risk > > sample_C 1 High_risk > > sample_C_2 3 High_risk > > sample_D 1 High_risk > > sample_D_2 3 High_risk > > sample_E 1 Medium_risk > > sample_F 1 Medium_risk > > sample_G 1 Medium_risk > > sample_G_2 3 Medium_risk > > sample_H 2 Low_risk > > sample_I 2 Low_risk > > sample_J 2 Low_risk > > sample_J_2 3 Low_risk > > sample_K 2 Low_risk > > sample_K_2 3 Low_risk > > > > For example Sample_C and Sample_C_2 are the same RNA sample and the only > difference between them is the batch (the same applies to D and D_2 etc.). > Such array pairs should be valuable for estimating batch effects. The > question is: Can ComBat utilize this information? Or can you recommend some > other batch correction method that could? For now, I've applied ComBat > after removing the replicated samples on batches 1 and 2 (C, D, G, J and K > in the above example) but this is certainly not an optimal solution. > > > > Best regards, > > > > Essi Laajala > > PhD student in bioinformatics > > Turku, Finland > > > > -------------- next part -------------- "Batch" "Covariate1" "E0107_53_E05.CEL" 1 "E0107" "E0112_54_E06.CEL" 1 "E0112" "E0116_55_E07.CEL" 1 "E0116" "E022_13_B01.CEL" 1 "E022" "E024_14_B02.CEL" 1 "E024" "E025_15_B03.CEL" 1 "E025" "E026_16_B04.CEL" 1 "E026" "E027_17_B05.CEL" 1 "E027" "E029_130023_E07.CEL" 4 "E029" "E029_18_B06.CEL" 1 "E029" "E031_19_B07.CEL" 1 "E031" "E033_20_B08.CEL" 1 "E033" "E036_21_B09.CEL" 1 "E036" "E044_22_B10.CEL" 1 "E044" "E048_23_B11.CEL" 1 "E048" "E049_24_B12.CEL" 1 "E049" "E050_25_C01.CEL" 1 "E050" "E051_26_C02.CEL" 1 "E051" "E052_27_C03.CEL" 1 "E052" "E054_28_C04.CEL" 1 "E054" "E057_30_C06.CEL" 1 "E057" "E060_31_C07.CEL" 1 "E060" "E061_32_C08.CEL" 1 "E061" "E063_33_C09.CEL" 1 "E063" "E066_34_C10.CEL" 1 "E066" "E067_130023_F07.CEL" 4 "E067" "E067_35_C11.CEL" 1 "E067" "E068_36_C12.CEL" 1 "E068" "E069_37_D01.CEL" 1 "E069" "E070_38_D02.CEL" 1 "E070" "E071_39_D03.CEL" 1 "E071" "E074_40_D04.CEL" 1 "E074" "E082_41_D05.CEL" 1 "E082" "E083_42_D06.CEL" 1 "E083" "E086_130023_G07.CEL" 4 "E086" "E086_43_D07.CEL" 1 "E086" "E088_44_D08.CEL" 1 "E088" "E091_45_D09.CEL" 1 "E091" "E093_46_D10.CEL" 1 "E093" "E096_47_D11.CEL" 1 "E096" "E098_48_D12.CEL" 1 "E098" "E102_49_E01.CEL" 1 "E102" "E104_130023_H07.CEL" 4 "E104" "E104_50_E02.CEL" 1 "E104" "E105_51_E03.CEL" 1 "E105" "E106_52_E04.CEL" 1 "E106" "E118_56_E08.CEL" 1 "E118" "E120_57_E09.CEL" 1 "E120" "E121_58_110049_E09.CEL" 2 "E121" "E125_59_110049_F09.CEL" 2 "E125" "E128_60_110049_G09.CEL" 2 "E128" "E133_61_110049_H09.CEL" 2 "E133" "P001_130003_A05.CEL" 3 "P001" "P005_130003_B05.CEL" 3 "P005" "P009_130003_C05.CEL" 3 "P009" "P014_130003_D05.CEL" 3 "P014" "P017_130003_E05.CEL" 3 "P017" "P017_130023_A05.CEL" 4 "P017" "P020_130003_G05.CEL" 3 "P020" "P021_130003_H05.CEL" 3 "P021" "P024_130003_A07.CEL" 3 "P024" "P025_130003_B07.CEL" 3 "P025" "P025_130023_C05.CEL" 4 "P025" "P026_130003_C07.CEL" 3 "P026" "P027_130003_D07.CEL" 3 "P027" "P028_130003_E07.CEL" 3 "P028" "P030_130003_F07.CEL" 3 "P030" "P031_130003_G07.CEL" 3 "P031" "P033_130003_H07.CEL" 3 "P033" "P033_130023_D05.CEL" 4 "P033" "P035_130003_A09.CEL" 3 "P035" "P036_130003_B09.CEL" 3 "P036" "P039_130003_C09.CEL" 3 "P039" "P041_130003_D09.CEL" 3 "P041" "P041_130023_E05.CEL" 4 "P041" "P042_130003_E09.CEL" 3 "P042" "P042_130023_F05.CEL" 4 "P042" "P044_130003_F09.CEL" 3 "P044" "P045_130003_G09.CEL" 3 "P045" "P046_130003_H09.CEL" 3 "P046" "P047_130003_A05.CEL" 3 "P047" "P048_130003_B05.CEL" 3 "P048" "P052_130003_C05.CEL" 3 "P052" "P054_130003_D05.CEL" 3 "P054" "P055_130003_E05.CEL" 3 "P055" "P056_130003_F05.CEL" 3 "P056" "P061_130003_G05.CEL" 3 "P061" "P063_130003_H05.CEL" 3 "P063" "P066_130003_A07.CEL" 3 "P066" "P066_130023_G05.CEL" 4 "P066" "P067_130003_B07.CEL" 3 "P067" "P070_130003_C07.CEL" 3 "P070" "P072_130003_D07.CEL" 3 "P072" "P073_130003_E07.CEL" 3 "P073" "P074_130003_F07.CEL" 3 "P074" "P075_130003_G07.CEL" 3 "P075" "P077_130003_H07.CEL" 3 "P077" "P082_130003_A09.CEL" 3 "P082" "P082_130023_H05.CEL" 4 "P082" "T021_130023_A07.CEL" 4 "T021" "T021_68_F04.CEL" 1 "T021" "T032_71_F07.CEL" 1 "T032" "T038_72_F08.CEL" 1 "T038" "T056_73_F09.CEL" 1 "T056" "T059_74_F10.CEL" 1 "T059" "T062_75_F11.CEL" 1 "T062" "T063_76_F12.CEL" 1 "T063" "T064_77_G01.CEL" 1 "T064" "T065_78_G02.CEL" 1 "T065" "T066_79_G03.CEL" 1 "T066" "T069_80_G04.CEL" 1 "T069" "T070_81_G05.CEL" 1 "T070" "T071_82_G06.CEL" 1 "T071" "T073_83_G07.CEL" 1 "T073" "T076_84_G08.CEL" 1 "T076" "T077_85_G09.CEL" 1 "T077" "T078_130023_B07.CEL" 4 "T078" "T078_86_G10.CEL" 1 "T078" "T093_110049_H09.CEL" 1 "T093" "T094_90_H02.CEL" 1 "T094" "T095_130023_C07.CEL" 4 "T095" "T095_91_H03.CEL" 1 "T095" "T099_93_H05.CEL" 1 "T099" "T103_96_H08.CEL" 1 "T103" "T106_98_H10.CEL" 1 "T106" "T109_130023_D07.CEL" 4 "T109" "T109_99_H11.CEL" 1 "T109" "T111_100_H12.CEL" 1 "T111"

ADD REPLY • link 10.7 years ago Essi Laajala ▴ 50

score 0 · Answer 2 · 2013-08-12

Essi, I saved your info file to my Desktop and tabulated your covariate and batch: > sam=read.table("Desktop/sample_info.txt",header=T) > table(sam) >From this I notice that you have sample IDs that are shared between Batches 1 and 4 as well as Batches 3 and 4. However Batch 2 has no shared samples with any other batch. You cannot combine batches based on shared/common samples if you don't have anything shared! Hence the singularity error because sample IDs and batch are completely confounded in Batch 2. >From my perspective you have two options: 1) remove batch 2 and combine 1,3,4 based on sample ID, or 2) replace your Covariate 1 back with your low/medium/high covariate, and ignore repeated samples. I guess there is also a third option, and that would be to do a two step batch correction, first by doing 1) and then followed by 2) using the result from 1). Hope this helps. Evan On Aug 12, 2013, at 6:29 AM, Essi Laajala wrote: Dear Evan, Thank you for your message! That's a good plan but don't you think it should lead to the singularity error? At least that's what it does to me. I've been using the old ComBat script and now I tried the Bioconductor version as well. I've attached the real sample_info file. (Sorry it's a bit complicated: there are actually 4 batches and batch 4 is the one with the re-hybridizations. I have 113 samples but 15 are re-hybridized so altogether 128 arrays. The "high risk" group has the label E, "medium risk" is T and "low risk" is P.) Here's what I did with the Bioconductor ComBat: > library(sva) > b <- sample_info[,"Batch"] > mm <- model.matrix(~as.factor(Covariate1), data=sample_info) > data_combat <- ComBat(exprs_data, b, mm) Found 4 batches Found 112 categorical covariate(s) Standardizing Data across genes Error in solve.default(t(design) %*% design) : Lapack routine dgesv: system is exactly singular: U[51,51] = 0 Best regards, Essi On Sun, Aug 11, 2013 at 2:57 AM, Johnson, William Evan <wej@bu.edu<mailto:wej@bu.edu>> wrote: Hi Essi, Yes, ComBat can definitely utilize this information. Just replace your current 'Covariate 1' with a covariate that just has the sample letter (e.g. A, B, C, C, D, D, E, ... ). Note that this will be sufficient because your 'Covariate 1' is nested within sample letter. Under this setup, ComBat will preserve all variation due to sample type (and as a result risk level) and effectively just use the repeated samples to adjust for batch. Hope this helps. Thanks! Evan On Aug 9, 2013, at 8:23 AM, Essi Laajala wrote: > Hi, > > I'm dealing with quite an unusual study design. Originally (due to unfortunate and inevitable circumstances) we had all "high_risk" and "medium_risk" samples on batch 1 and "low_risk" samples on batch 2. Then we discussed the batch effect and decided to re-hybridize some randomly selected samples from each risk group on batch 3. The resulting study design looks a bit like like this (in reality we have 30 - 45 samples in each group and 16 samples re-hybridized but you'll get an idea): > > Array name Batch Covariate 1 > sample_A 1 High_risk > sample_B 1 High_risk > sample_C 1 High_risk > sample_C_2 3 High_risk > sample_D 1 High_risk > sample_D_2 3 High_risk > sample_E 1 Medium_risk > sample_F 1 Medium_risk > sample_G 1 Medium_risk > sample_G_2 3 Medium_risk > sample_H 2 Low_risk > sample_I 2 Low_risk > sample_J 2 Low_risk > sample_J_2 3 Low_risk > sample_K 2 Low_risk > sample_K_2 3 Low_risk > > For example Sample_C and Sample_C_2 are the same RNA sample and the only difference between them is the batch (the same applies to D and D_2 etc.). Such array pairs should be valuable for estimating batch effects. The question is: Can ComBat utilize this information? Or can you recommend some other batch correction method that could? For now, I've applied ComBat after removing the replicated samples on batches 1 and 2 (C, D, G, J and K in the above example) but this is certainly not an optimal solution. > > Best regards, > > Essi Laajala > PhD student in bioinformatics > Turku, Finland > <sample_info.txt> [[alternative HTML version deleted]]