i'm currently analyzing a dataset of human HTA 2.0 affymetrix microarrays, for statistical analysis of a two-group comparison (healthy subjects and different subject samples from a autoimmune chronic disease).
After import/pre-processing/normalization, i created further some EDA plots, to access/investigate any putative batch effects, as i have the following information, that both healthy controls, as the disease samples belong to 3 different studies (only the control samples belong to the same study/batch)-the links for the MDS plot and a hc dendrogram are the following:
(* for simplicity, the different color in both plots represents the different origin/study, whereas the main condition/label is Normal & SLE phenotypes)
So, from an initial investigation of the above 2 plots, it does not seem any severe batch effect regarding the origin/study (Additional HCs=control samples, SLE=ILLUMINATE-1 & ILLUMINATE-2), which could imply an severe correction. However, to be certain for any downstream statistical comparison with limma, i should just include the batch information in my linear model, in order to take into account this information ?
Or, due to the following :
Additional HCs ILLUMINATE-1 ILLUMINATE-2
30 74 76
group <- pData(eset.rma)$characteristics_ch1.2.group # main variable for downstream DE comparison
comb <- paste0(pData(eset.rma)$characteristics_ch1.2.group,
Normal_Additional HCs SLE_ILLUMINATE-1 SLE_ILLUMINATE-2
30 74 76
because the "batches" differ in number, it is not generally then advisable to include batch adjustment at all in the design matrix ?
Or overall, despite not seeing a strong batch effect in the above initial plots, there is a possible confunding of my batch levels with my condition of interest, and thus some batch effect correction should be applied ? like ComBat ?
Thank you in advance,