Hi,
I'm running a differential exon usage analysis on a complex set of RNAseq data. I have 152 cell line samples split into 4 batches, divided into 2 timepoints, 4 product types (which are then divided into high producer and low producer) and 1 control type.
The sample information table, therefore, has 4 conditions:
- Batch (1, 2, 3 and 4)
- Timepoint (day5 and day10)
- Product ( A, B, C, D and N/A)
- Productivity (high, low and N/A)
Currently, we are interested in running two types of contrasts:
One on the full dataset:
- Product: Producers vs Non-producers: Producers A, B, C, D versus N/A.
One on a product-specific subset (separately for subsetted data A, B, C and D).
- Productivity: High vs Low vs N/A: Productivity high versus low versus N/A.
Currently, we are using a very simple model to test the waters which seems to output results:
full_model <- ~sample + exon + productivity:exon
reduced_model <- ~sample + exon
We are currently unsure whether DEXSeq should run a full model with blocking factors on the whole set of data.
full_model <- ~sample + exon + timepoint:exon + batch:exon + productivity:exon
reduced_model <- ~sample + exon + timepoint:exon + batch:exon
I am aware of interaction effects that can be modelled but am unsure as to whether it would benefit from adding further complexity to the model design.
When we attempt further complex models we find runtimes are significantly longer, especially for estimating dispersions and testing DEU. Therefore, currently, we tend to subset the data into smaller chunks (such as by product, timepoint or batch) and run smaller separate analyses.
Do you have some advice regarding complex models (does ours make sense?), multiple blocking factors and a very large amount of samples?
Any help would be greatly appreciated.
regards
Ben J Draper