I have a dataset with a very complex set up, that I don't seem to be handling well: whatever I do I have a relationship between the fold change and the average expression of the genes.
I have 5 main variables:
- Cell type: stem or differentiated cells
- Group disease or control
- Location: ileum or colon
- Type: from pediatric or adult samples
- Creation: old or new.
So far I decided to analyse as two cohorts the old and new samples, because they are from different experiment matrigels, there has been a couple of years in-between...
For the remaining variables I used a design experiment of interaction where I have each combination of the variables as a variable of the design:
However, this ends up with comparisons like this one (done via limma):
We can see that the higher the average expression is the bigger the logFC is, while I expected that the average expression would affect the fold change.
I tried changing the design to a more simple one with less interactions, I was recommended to normalize just the samples I use for each comparison but both resulted in worse results. I tried correcting using surrogate variables from
sva package and it didn't work (despite finding 2 surrogate variables). The PCA did not show any clear batch effect, only that stem and diff cells have very different expression (separates them by first component, which explained the 36.5% of the variance).
I don't have more ideas to try, and suggestions about how to design/normalize the data are welcomed.