much has been written and asked on this topic but I'm still a bit confused and would appreciate it very much to get a defined guideline on how to handle batch effects in omics datasets (in my case proteomics but its probably applicable to any).
So I do have two data frames, one contains my expression values (rows = sample / cols = proteins), the other contains sample annotations with various characteristics (rows = sample, cols = variables).
I do expect that a great batch effect comes from when the sample was measured in the mass spec, simply because this is what everyone told me (actually this must have been very bad in former times making it impossible to compare samples measured at different days but we are using data independent acquisition and this one appears much more robust). Anyways, there ought to be a batch effect here.
Secondly, there are other annotation variables I expect to have an effect eg (country sample was collected in, date sample was collected, gender/age of patient) in the sense of that this variable probably has an effect that does not come from the patients genotype affecting the cell proteome (the effect I actually wand to measure).
I do get SVA to run fine, protecting the differences of my genotypes and correcting for the known batch effect of date.measured, but it tells me that there are 4 additional surrogate variables it will also correct for. It parses this info to ComBat and out comes a corrected data frame that results in a fine looking PCA.
My question is now how to shed more light onto those effects the variables have on my data as it feels quite like a blackbox just correcting for them without objectifying/plotting them.
So how can I quantify/objectify/plot the effect my suspected variables (data.measured, date.collected, age, sex) actually do have on my data? Are there any examples how to do this / which method to use?
My goal is not only be objectify/plot these influences but then decide if I actually want to correct for them or not (eg tell sea to correct or protect for it) as in the case of sex and age I maybe better want to keep it but remove the effect of date.measures/collected.
If I would have identified these factors I would hope that sea does not find additional surrogate variables.
I would appreciate any help and hints for this, once I really get it I will try summarising it in a blog post so there is a easy to follow walkthrough of this certainly often occurring task.
Thanks a lot!