Greetings,
My experimental design consists of two experimental factors, four conditions (+/+, -/+, +/-, -/-) and four biological replicate per condition. The batch effect in this experiment comes from the fact that the first replicate for all conditions was taken separately, the second separately and so on.
Now I have written a somewhat bloated R script that tries to accomplish two separate goals:
(1) find genes that are differentially expressed between presence and absence of the two experimental factors separately, i.e. which are differentially expressed due to factor 1 alone or factor 2 alone.
(2) getting batch-free TMM-normalized FPKM values (where the batch effect has been removed). This will not be used in any differential expression analysis, just for making bar graphs, clustering and heat maps generation.
I accomplished (1) by essentially modifying the Arabidopsis example from the awesome EdgeR user guide to fit this dataset. This worked great.
For (2), my previous idea was to use my matrix with raw counts, log2 transform them, use removeBatchEffect() with a design matrix that lack batch effect factor, remove log2 transformation and use a utility wrapper from the Trinity assembler to get TMM-normalized FPKM values.
After reading the answers to some previous questions:
RNA-Seq, generate batch-free count matrix
Remove batch effect in small RNASeq study (SVA or others?)
...on the Bioconductor support site I now consider something like this:
- make DGEList object from counts.
- filter by total counts to remove non-expressed features.
- TMM-normalization with calcNormFactors().
- log2 transformation.
- remove batch effect with removeBatchEffect() specifying a batch factor and a design matrix with only experimental factors (i.e. without a batch factor).
- use rpkm().
Does this procedure make sense for the goals of being able to make bar graphs for "batch-free" TMM-normalized RPKM values / heat map based on clustering? Does this procedure even produce something that can be reasonable called "batch-free TMM-normalized RPKM values", considering this response:
A: RNA-Seq, generate batch-free count matrix
...to a previous question? If not, what is the technically correct name for the values that the procedure above produces?
Thanks,