Question

Call DE genes on Unbalanced design with controls

0

Entering edit mode

timedreamer ▴ 10

@timedreamer-18140

Last seen 4.9 years ago

New York University

Hi there,

I googled the question, but could not find an answer that can solve my question, so I post it here.

Thank you so much in advance!!

I recently received a dataset that has already been sequenced. The idea was: for each batch of cell, transfected with one control vector and a bunch of gene overexpression vector. So, in batch one, I have one control vector and six overexpression vector, each with three replicates. In batch two, I have the same control vector but different six overexpression vector. The purpose is to see what genes are DE comparing overexpression with control samples. I know it's not a good design, but unfortunately, it has already been made. BTW, the batch effect is very obvious on PCA. I currently use edgeR for analysis. The previous analysis was done in DESeq2 by a colleague.

Question 1 is: if I use the model design <- model.matrix(~batch+plasmid). Since there is only control vector was repeated, does this model make sense? Or in another word, do I combine all batch together and use ~batch+plasmid OR separate each batch to call DE genes using ~plasmid? I'm not sure statistically which one is slightly better.

Question 2 is: if I repeat the experiment with vectors random picked six vectors, two batches. Will it help? If so, does the help come from simply more replicates?

Question 3 is: if I re-do the experiment, do you recommend put each replicate in separate batch, trying to fit a Balanced Incomplete Block (BIB) design or something like that? I can't do one replicate for all TFs in one batch (limited material).

A simple case would be like this:

plasmid <-factor(c(rep("control",3),rep("tf1",3),rep("control",3),rep("tf2",3)))
batch <- factor(c(rep("1",6),rep("2",6)))
design <- model.matrix(~batch+plasmid)
design

 (Intercept) batch2 plasmidtf1 plasmidtf2
1            1      0          0          0
2            1      0          0          0
3            1      0          0          0
4            1      0          1          0
5            1      0          1          0
6            1      0          1          0
7            1      1          0          0
8            1      1          0          0
9            1      1          0          0
10           1      1          0          1
11           1      1          0          1
12           1      1          0          1
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$`batch`
[1] "contr.treatment"

attr(,"contrasts")$plasmid
[1] "contr.treatment"

edger experimental design • 794 views

ADD COMMENT • link 4.9 years ago timedreamer ▴ 10

score 1 · Answer 1 · 2019-05-17

Given that this is effectively an edgeR question, there's no point putting a DESeq2 tag here unless you want Mike to answer something specific.

Anyway, onto your questions. I'll refer to your simplified design as an example.

1. ~batch+plasmid is fine, assuming that the batch effect is additive. If it's not additive, it's still okay, provided you don't compare plasmidtf1 to plasmidtf2 (i.e., only compare within each batch). It would be unwise to subset your samples and perform the DE analysis separately; you need all the replication you can get.

2. Yes, the more replicates, the better. This gives you more accurate and precise dispersion estimates, which improves power. It also increases the robustness of the analysis to violations of distributional assumptions used in empirical Bayes shrinkage.

3. If you must have batches (e.g., from a logistical perspective), then the ideal design would contain the same number of control, tf1 and tf2 samples in each batch, for multiple batches. But if you can't do that, then a BIB approach would probably be the next best option. Minimize the number of blocks and maximize the overlaps in treatment conditions between blocks, as much as your material limits allow.