I have a rna-seq data set with a paired design ( 30 tissues with treated and untreated condition ). I am trying to classify the samples in to different subsets as there is lot heterogeneity in the data i.e. a subset of samples might behave differently to the treatment, few might not respond at all etc as its a primary human cells data.
I have tried clustering/PCA analysis ( with in DESeq2) to look how the samples are clustered based on normalised or variancestabilized read counts but the results looks like mixed. I could not clearly identify the subset of samples.
I would like to take a different approach. As it is a paired design, I can calculate the fold-change of normalised gene expression values between each pair of samples and use the fold-change matrix to do a clustering/PCA analysis to identify the outlier samples. This might give me an idea about the samples which behave in a similar manner to treatment effects.
I would like to get more suggestions on this.
1. What is the best way to calculate fold-change values between each pair. Applying the fold-change formula to the normalised/variancestabilized read counts would be enough ? Or loading each pair of samples into
DESeq2 to calculate fold-changes would be a good idea ?
2. Any pointers to papers or statistical methods where they have done this sort of analysis to deal with heterogeneity in large data sets is really appreciated.
3. Any other suggestions or statistical methods would be useful.