I have a large (bulk) RNA-seq data set with ~1500 samples, i.e. ~30 multiplexing runs (library prep and sequencing) with in total ~500 different conditions in triplicates (conditions are somehow randomized across runs). The ultimate goal is to test all 500 conditions against the wildtype-controls (which are present in many but not all multiplexing runs), taking into account 1) the batch effect originating from the multiplexing runs and 2) an additional unwanted biological source of variation for which we have negative controls: Our cells are haploid by default but tend to become diploid, even the WT-controls. That's why we also have known haploid WT-controls and known diploid WT-controls.
Can I use RUVs with the known haploid and diploid WT-controls as negative controls to account for the diploidization effect and the multiplexing runs?
My approaches so far:
A) Running RUVs() with 'condition' as indicator for the replicate samples, while the known haploid and diploid WT-controls have the same condition:
RUVs(x=counts(dds), cIdx=rownames(dds), k=10, scIdx=makeGroups(dds$condition))
This results in the first 8 latent factors being correlated with multiplexing runs, and factors 9 and 10 nicely separating known haploid from known diploid WT-controls.
B) Running RUVs() on batch-corrected vst-transformed data
vsd <- vst(dds)
assay(vsd) %<>% limma::removeBatchEffect(vsd$run)
RUVs(x=assay(vsd), cIdx=rownames(vsd), k=10, scIdx=makeGroups(vsd$condition), isLog = TRUE)
Here, the first 6 latent factors separate a few strong phenotypes, while factor 7 captures the diploidization.
What is a good design for differential testing?
- A) with design ~condition + run + factor 9 + factor10
- A) with design ~condition + factor1 + ... + factor10
- B) with design ~condition + run + factor7
- something else