**0**wrote:

I have a RNAseq experiment with 5 time points and 2 cell type (2*5=10 conditions) and 3 replicates each. The experiment can be further divided to two sub-experiment -- cell type generation (3 time points * 2 cell types) and cell type re-stimulation (2 time points * 2 cell types) -- since we are not interested in comparing cell type generation samples to cell type re-stimulation samples.

```
cell1 12hr
cell1 24hr
cell1 48hr
cell2 12hr
cell2 24hr
cell2 48hr
cell1 rest0hr
cell1 restim4hr
cell2 rest0hr
cell2 restim4hr
```

I performed TMM normalization using all 30 samples. However, when it comes to exploratory visualization, it is hard to tell obvious condition-wise clustering on all 30 samples using PCA plot. all*PCA*plot

Thus, I did PCA independently on sub-experiment (cell type generation, cell type re-stimulation). Cluster becomes a little bit obvious but still hard to see condition-wise clustering. Here is the example of cell type generation experiment PCA. cell type generation PCA plot

Finally, I just did PCA on samples which I want to do pairwise comparison. (eg. cell type comparison at each time points: cell1*vs*cell2 at 12hr, cell1*vs*cell2 at 24hr, ..... And eg. time point comparison within cell type: cell1 restim4hr vs. 0hr, .....). condition clustering become obvious and replicate batch is the major confoundering factor here. Here is the example of cell1*vs*cell2 at 24hr. cell1*vs*cell2 at 24hr PCA

```
# prepare design
groups <- factor(gsub("_rep[0-9]","",colnames(y$counts)))
groups <- relevel(groups, ref="cell1_12hr")
batch <- factor(str_extract(colnames(y$counts),"rep[0-9]"))
batch <- relevel(batch, ref="rep1")
design <- model.matrix(~batch+groups, data=y$samples)
```

So for the differential analysis on edgeR, I did similar thing like my exploratory analysis. Using all 30 samples calculate dispersion `estimateDisp`

, fit data `glmQLFit`

and perform test `glmQLFTest`

to get DE in pairwise (eg. cell2*vs*cell1 at 24hr). However, I got no differential genes.

Then, I tried to subset samples based sub-experiment (eg. cell type generation sub-experiment include triplicates from cell1*12hr, cell2*12hr, cell1*24hr, cell2*24hr, cell1*48hr, cell2*48hr), I used those 18 samples for differential genes in pairwise (eg. cell2*vs*cell1 at 24hr). Still, I got no differential genes.

Finally, when I subset samples only for pairwise comparison groups (eg. cell2*vs*cell1 at 24hr includes triplicates from cell1*24hr and cell2*24hr), then use those subset samples to recalculate dispersion `estimateDisp`

, fit data `glmQLFit`

and perform test `glmQLFTest`

. Using smaller number of samples, I got smaller dispersion which is only relevant to the samples I want to compare. I am able to get hundreds differential genes.

So my question is : is it valid to subset dataset for differential analysis? Which method (using all samples, subset samples based on experiment, subset only pairwise comparison group samples) generates more reliable results?