I have a somewhat technical statistical question based on some analyses I have been dealing with while using DESeq2 software for determining differences in the expression of certain miRNAs between groups.
We performed a small-RNAseq experiment based on 4 different groups with 12 animals each, to determine possible differences in miRNA genes expression across our different states.
When analyzing our data, we realised that many of the miRNAs were very lowly expressed, as expected, with few of them being highly expressed. We calculated the Biological Coefficient of Variation (BCV) for each miRNA in each group, based on formula reported by edgeR developers, as the square root of the dispersions estimated for each gene with estimateDispersions() function. By doing so, we were able to se the expected result that miRNAs with low levels of expression tended to have higher BCV values, than highly expressed miRNAs, having low BCV values. However, we saw some of the miRNAs having BCVs higher than expected compared with their expression values, and more interestingly, that some of these abnormal miRNAs having high BCVs while highly expressed in one group, behaved normally in the other groups. To sum up, we detected that some miRNAs were behaving in a strange way according to their BCVs and gene expression levels in some groups, while behaving normally in other groups.
Dealing with this phenomenon, we tried to check if these differences were significant across groups for certain miRNAs. More or less the same than typically contrasting differences in gene expression across groups considering means, but considering dispersion of data, say, contrasting differences in gene dispersion, not gene expression, checking variance, not means.
For doing so, we calculated the dispersion from the mean miRNA expression value, in each sample, using this formula:
abs(normalized counts from gene expression - mean gene expression)
The further the expression value in each sample from the mean expression in the group, the higher the dispersion value, positivizing the negative values when expression levels were less than for the mean. This new matrix for dispersion in genes, had binomial negative distribution similarly to gene expression matrix, with a concentrated amount of genes with low dispersion values, quantitatively, and a tail of genes with high dispersion values.
This new matrix was inserted in the cannonical DESeq2 differential expression pipeline and results were obtained as genes performing differential "dispersion", not expression, with FoldChange values, P-values and FDR statistics.
My question is, would it be a statistically correct approach? I have not been able to find any software performing this kind of analysis, contrasting gene dispersion across groups, instead of gene expression values.
Is there an alternative approach to do this?