Essi,
I saved your info file to my Desktop and tabulated your covariate and
batch:
> sam=read.table("Desktop/sample_info.txt",header=T)
> table(sam)
>From this I notice that you have sample IDs that are shared between
Batches 1 and 4 as well as Batches 3 and 4. However Batch 2 has no
shared samples with any other batch. You cannot combine batches based
on shared/common samples if you don't have anything shared! Hence the
singularity error because sample IDs and batch are completely
confounded in Batch 2.
>From my perspective you have two options: 1) remove batch 2 and
combine 1,3,4 based on sample ID, or 2) replace your Covariate 1 back
with your low/medium/high covariate, and ignore repeated samples. I
guess there is also a third option, and that would be to do a two step
batch correction, first by doing 1) and then followed by 2) using the
result from 1).
Hope this helps.
Evan
On Aug 12, 2013, at 6:29 AM, Essi Laajala wrote:
Dear Evan,
Thank you for your message! That's a good plan but don't you think it
should lead to the singularity error? At least that's what it does to
me. I've been using the old ComBat script and now I tried the
Bioconductor version as well. I've attached the real sample_info file.
(Sorry it's a bit complicated: there are actually 4 batches and batch
4 is the one with the re-hybridizations. I have 113 samples but 15 are
re-hybridized so altogether 128 arrays. The "high risk" group has the
label E, "medium risk" is T and "low risk" is P.) Here's what I did
with the Bioconductor ComBat:
> library(sva)
> b <- sample_info[,"Batch"]
> mm <- model.matrix(~as.factor(Covariate1), data=sample_info)
> data_combat <- ComBat(exprs_data, b, mm)
Found 4 batches
Found 112 categorical covariate(s)
Standardizing Data across genes
Error in solve.default(t(design) %*% design) :
Lapack routine dgesv: system is exactly singular: U[51,51] = 0
Best regards,
Essi
On Sun, Aug 11, 2013 at 2:57 AM, Johnson, William Evan
<wej@bu.edu<mailto:wej@bu.edu>> wrote:
Hi Essi,
Yes, ComBat can definitely utilize this information. Just replace your
current 'Covariate 1' with a covariate that just has the sample letter
(e.g. A, B, C, C, D, D, E, ... ). Note that this will be sufficient
because your 'Covariate 1' is nested within sample letter. Under this
setup, ComBat will preserve all variation due to sample type (and as a
result risk level) and effectively just use the repeated samples to
adjust for batch.
Hope this helps. Thanks!
Evan
On Aug 9, 2013, at 8:23 AM, Essi Laajala wrote:
> Hi,
>
> I'm dealing with quite an unusual study design. Originally (due to
unfortunate and inevitable circumstances) we had all "high_risk" and
"medium_risk" samples on batch 1 and "low_risk" samples on batch 2.
Then we discussed the batch effect and decided to re-hybridize some
randomly selected samples from each risk group on batch 3. The
resulting study design looks a bit like like this (in reality we have
30 - 45 samples in each group and 16 samples re-hybridized but you'll
get an idea):
>
> Array name Batch Covariate 1
> sample_A 1 High_risk
> sample_B 1 High_risk
> sample_C 1 High_risk
> sample_C_2 3 High_risk
> sample_D 1 High_risk
> sample_D_2 3 High_risk
> sample_E 1 Medium_risk
> sample_F 1 Medium_risk
> sample_G 1 Medium_risk
> sample_G_2 3 Medium_risk
> sample_H 2 Low_risk
> sample_I 2 Low_risk
> sample_J 2 Low_risk
> sample_J_2 3 Low_risk
> sample_K 2 Low_risk
> sample_K_2 3 Low_risk
>
> For example Sample_C and Sample_C_2 are the same RNA sample and the
only difference between them is the batch (the same applies to D and
D_2 etc.). Such array pairs should be valuable for estimating batch
effects. The question is: Can ComBat utilize this information? Or can
you recommend some other batch correction method that could? For now,
I've applied ComBat after removing the replicated samples on batches 1
and 2 (C, D, G, J and K in the above example) but this is certainly
not an optimal solution.
>
> Best regards,
>
> Essi Laajala
> PhD student in bioinformatics
> Turku, Finland
>
<sample_info.txt>
[[alternative HTML version deleted]]