Hello there, let's start from scratch: In my experience (about 2 years of bioinformatics), I've always had to work on samples coming that, unfortunately, come from different batches (i.e.experiments performed in different days and/or samples coming from different hospitals). In my lab, a postdoc showed me how to deal with batch effect by using SVA package if the source of bias is unknown. I guess it is a brilliant idea to work with that, but if I don't go wrong this is not the best tool to cope with that and plus I may know the source of confounding (in other words, I know that samples are generated in different days/come from different hospitals).
My question is more technical and very likely linked to my inexperience. By slavishly following postdoc recommendations, I have always used this SVA package even though - as previously mentioned - it doesn't seem to be so convincing for my purposes. In other words, I would like to deal with this problem in a definite way; I have read a lot of threads opened both in Bioconductor and Biostars but there are a lot of question marks in my head.
Is designing of a model matrix a good way to start, by any chance? I have zero ideas on where to start and if you could help me with some advice, that would be grand! I really need your help guys, cause I am absolutely alone now and don't know who to ask. Before this post was created I have also started a discussion on Biostars and the conclusion was to ask some of the smart guys here.
My understanding is that in a linear model, gene expression is given by the sum of modelled variables (i.e. biological/batch factors) you have to take into account. So, in the simplest scenario:
gene expression ~ factor_1 + batch
Now, let's say I have the following condition (where batches I, II and III are different days of the year):
factor_1 batch sample1 neutrophils_cancer_type1 I sample2 neutrophils_cancer_type1 II sample3 neutrophils_cancer_type2 I sample4 neutrophils_cancer_type2 III sample5 neutrophils_cancer_type1 I sample6 healthy II sample7 healthy II sample8 healthy III sample9 neutrophils_cancer_type2 II
Important note: I am using Kallisto to perform pseudoaligment and then DESeq2 via tximport (i.e. to generate a txi.kallisto.tsv count matrix that will be used as input for DESeq2). So, once you've generated your SampleTable, if your samples come from the same batch I know that you are ready to go with the following:
sampleTable$batch = factor(c("I","II","I","III","I","II","II","III","II")) dds = DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~batch+condition)
- One should have taken into account the fact that in your experiment there's a batch effect and thus genes that are differentially expressed are truly dependent by biological effect. Correct?
Someone in Biostars told me I cannot use limma:removeBatchEffect() for differential expression analysis, neither in Limma nor in DESeq2. In fact according to Limma guys:
This function is not intended to be used prior to linear modelling. For linear modelling, it is better to include the batch factors in the linear model
batch factors? what are they sorry? probably because, according to the guy who helped me in Biostars, doing differential expression analysis on batch-effect corrected data may lead to higher false-positive and false-negative results (but I don't have a background to say whether or not this speculation is correct). Thus, this allow me to conclude that the only thing I can do is removing batch effect in rlog- or vst-transformed dataset as a visualisation tool in my experiment.
Concluding, it's not clear to me if controlling for a batch effect means that I am telling DESeq2 to model gene expression considering that there are different batches in my analysis. Is the same if one says that I have removed the batch effect? Otherwise, in terms of biological results (low false positives/negatives and likelihood to truly address biological question), what’s the difference between controlling for batch effect and removing it from the analysis? What’s the usefulness of controlling for a bias in your experiment rather than removing the source of its bias? I think, at this stage, is just my misunderstanding!
I know it's a long thread but I am really trying to understand a pivotal thing here. Thanks,