I have a brief question regarding some analyses we are performing using DESeq2, and I was hoping you could give me an opinion or some suggestions in order to improve our procedure.
Our experimental design features ~80 samples, divided in 19 classes (which correspond to different cellular populations).
Unfortunately, our design is uneven, so that for one condition we have as much as 6 samples (max), and for two conditions there is one sample each.
Our goal is to select genes which are specifically expressed in some of these classes, so that the expression profiles of selected genes are similar to the following distribution density vectors
for selection 1)
"0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1" (dimensionality is 19 as the number of classes)
and for selection 2)
"0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0"
selection 3)
"0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1"
and selection 4)
"0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1"
In the procedure we used for this selection process (in DESeq2) we created a sampletable where we set as TREATED the samples for which the genes had to be 'specifically expressed' and the rest as 'CONTROL'. Of course, if we are selecting for samples in two classes (one constituted by 6 samples and one by 1) the design for the comparison may be unbalanced.
Once we obtained the results (a list of padj-ordered genes for each of the 4 comparison), we calculated Pearson correlation values for each gene and a class-specific reference model (the binary vectors reported above).
We obtained input values for the Pearson correlation calculations by calculating mean values for each one of the 19 cellular populations (from a different DESeq2 session), like this:
normalized.counts <- as.data.frame(counts( dds, normalized=TRUE))
rowMeans(counts(dds,normalized=TRUE)[,dds$condition == populations[1]]
We found an overall good correlation between the pearson coefficients calculated on the samples means of classes and the pvalues assigned to the genes by the DESeq2 for the different (4) classes. We used a threshold of 0.9 for the correlation coefficients in order to identify only genes which are 'specifically' expressed in the classes of our choice.
NOTE: while genes with very low padj (e.g. 10^-20) also have high correlation coefficients with our model distributions, genes with padj of the order 10(^-3/-4) may have non-specific expression (with Pearson correlation values as low as 0.3)
Thank you very much
thank you Mike!
thank you Mike!
thank you Mike!