Hello,

I have a somewhat technical statistical question based on some analyses I have been dealing with while using DESeq2 software for determining differences in the expression of certain miRNAs between groups.

We performed a small-RNAseq experiment based on 4 different groups with 12 animals each, to determine possible differences in miRNA genes expression across our different states.

When analyzing our data, we realised that many of the miRNAs were very lowly expressed, as expected, with few of them being highly expressed. We calculated the Biological Coefficient of Variation (BCV) for each miRNA in each group, based on formula reported by edgeR developers, as the square root of the dispersions estimated for each gene with estimateDispersions() function. By doing so, we were able to se the expected result that miRNAs with low levels of expression tended to have higher BCV values, than highly expressed miRNAs, having low BCV values. However, we saw some of the miRNAs having BCVs higher than expected compared with their expression values, and more interestingly, that some of these abnormal miRNAs having high BCVs while highly expressed in one group, behaved normally in the other groups. To sum up, we detected that some miRNAs were behaving in a strange way according to their BCVs and gene expression levels in some groups, while behaving normally in other groups.

Dealing with this phenomenon, we tried to check if these differences were significant across groups for certain miRNAs. More or less the same than typically contrasting differences in gene expression across groups considering means, but considering dispersion of data, say, contrasting differences in gene dispersion, not gene expression, checking variance, not means.

For doing so, we calculated the dispersion from the mean miRNA expression value, in each sample, using this formula:

abs(normalized counts from gene expression - mean gene expression)

The further the expression value in each sample from the mean expression in the group, the higher the dispersion value, positivizing the negative values when expression levels were less than for the mean. This new matrix for dispersion in genes, had binomial negative distribution similarly to gene expression matrix, with a concentrated amount of genes with low dispersion values, quantitatively, and a tail of genes with high dispersion values.

This new matrix was inserted in the cannonical DESeq2 differential expression pipeline and results were obtained as genes performing differential "dispersion", not expression, with FoldChange values, P-values and FDR statistics.

My question is, would it be a statistically correct approach? I have not been able to find any software performing this kind of analysis, contrasting gene dispersion across groups, instead of gene expression values.

Is there an alternative approach to do this?

Many thanks

I am not fully interested in assesing differences in global BCV distribution across groups, but in determining if differences in gene-wise BCV values across groups are different at a significant level.

As you can see in ploted BCVs for each analyzed gene across groups, I am interested in differences not in overall distribution, but on BCV values or BCVs that do fit the actual distribution in one group, but seems not to be the case for the other group.

Appying the approach I described in the fist message, I obtained these two genes as "Differentially Dispersed" across groups, among few others. The BCV value for ssc-miR-1285 is 0.6 in ALT2 groups, while 1.20 in ALT0 group. These are the differences I am interested in evaluating.

Similarly, BCV for ssc-miR-122-5p was around 0.5 in ALT0 and 1.30 in ART0 group.

I obtained a table with one BCV value for each gene (miRNA in this case) in each group, and wanted to test if these values were significantly different. As I did not have gene-wise repetitions for BCV values, this is the reason I opted for calculating dispersion values for each gene and each sample, so as to be able to implement a contrasting hypothesis test.

What kind of hypothesis testing would be the most reccomended for this case, was my actual question.

Many thanks.

I’m not sure what packages or methods will do this, but DESeq2 does not have such a test.

I know, this is the reason why I took a somewhat alternative approach to calculate a dispersion matrix for each gene in each sample. The distribution of this new matrix fits a negative binomial distribution, at least this is what my data tells me. I used this new matrix to be embeded in clasical DESeq2 differential expression test.

I know this is somewhat a non-cannonical and surely not reccomended approach as DESeq2 was not designed for this purpose, but I have not been able to find an alternative and better suited way to test my hypothesis...