I have a rna-seq data set with a paired design ( 30 tissues with treated and untreated condition ). I am trying to classify the samples in to different subsets as there is lot heterogeneity in the data i.e. a subset of samples might behave differently to the treatment, few might not respond at all etc as its a primary human cells data.

I have tried clustering/PCA analysis ( with in DESeq2) to look how the samples are clustered based on normalised or variancestabilized read counts but the results looks like mixed. I could not clearly identify the subset of samples.

I would like to take a different approach. As it is a paired design, I can calculate the fold-change of normalised gene expression values between each pair of samples and use the fold-change matrix to do a clustering/PCA analysis to identify the outlier samples. This might give me an idea about the samples which behave in a similar manner to treatment effects.

I would like to get more suggestions on this.

1. What is the best way to calculate fold-change values between each pair. Applying the fold-change formula to the normalised/variancestabilized read counts would be enough ? Or loading each pair of samples into `edgeR`

and `DESeq2`

to calculate fold-changes would be a good idea ?

2. Any pointers to papers or statistical methods where they have done this sort of analysis to deal with heterogeneity in large data sets is really appreciated.

3. Any other suggestions or statistical methods would be useful.

Thanks.

Thanks for the response.

`This looks a bit odd as around 45% genes are shown to be differentially regulated. vsd <- varianceStabilizingTransformation(dds) # PCA Plot of VSD with default parameters ( ntop=500)`

Here I could see the different groups but this does not reflect in DE genes, as shown below.

`#PCA plot of VSD of Differentially expressed genes at 0.05 FDR. (vsd[de,])`

hi,

Scanning you code, it looks correct. There's nothing here to make me think that the 45% DEG is wrong, or indicating an error in analysis. It could be that you have 60 samples, and so have good power, there are true differences, and you are testing a null hypothesis of no difference across treatment. I would encourage you to investigate the top genes visually using plotCounts or other methods. Note that, when you have many samples, if you are interested only in large differences, you can use the lfcThreshold argument.

The PCA of all the data looks like the samples were prepared in at least two batches. Is this correct? The 'subject' part of the design should account for the batch difference as long as the pairs were always in the same batch. Is this the case?

I need to check thoroughly the confounding factors. The pairs were always in the same batch i.e processed together. Though , according to my knowledge, the paired-design takes care of batch effects, is there something I could to to remove the batch effects or improve the analysis ?

To update, The PCA plot divides the samples to two groups and the confounding factor here is Gender. The samples from Males and Females separates out into two groups.

Hi,

I am facing the same problem as g.atla last updated. My PCA instead of grouping the samples by the condition (case / control) is grouping by Gender. So, I have decided to use the following command:

With this, I am obtaining PC1: 62% variance and PC2: 13% variance. Would it be OK?

It's typical that PCA would group by sex, and also typical to include it in the design as you have done.

The percent variance on the PCA plot will not change according to the design formula. It's unsupervised.

Thanks Michael. So, this would be all I can do to control for gender right?

Yes this is the typical way to control for a variable while testing another.

I have a new question Michael related to PCA plot:

When I use:

The result is totally different and also the variance. (Figure attached, PLOT_2) Are both valid?

From the first plot (low variances), does it mean that something is going wrong? As you told me in the previous message, modifying the dessign does not change the percent variance of the PCA plot. So, what should I do?

Thanks so much.

The second plot isn’t a proper variance stabilization and is not a great method therefore (see our paper).

So, would it suppose a problem having so low variance as in the first plot? My dessign is

~ sex + batch + condition but if I add the age as a covariate the percentage variance of the PCA plot should be the same? What can I do with the data to obtain an increased variance?

Thanks a lot Michael.

The design doesn’t really effect the PCA. There’s nothing to fix really, the PCA let’s your visualize your data as it is.