1- in miRNA analysis, when determining the CPM that corresponds to 10 raw reads it would be CPM>1 in a 10 million library size (miRNA counts). However, in some samples, miRNA can represent a small fraction and can be 0.5 to 0.1 million or less. Is it valid to use a CPM >20 or >100 or more for filtration?
2- The library size that we base CPM filtration value on, is it the mapped counts or only miRNA counts?
For your first question - yes, it's fine to adjust the CPM threshold. The important thing is how big the underlying counts are, which contributes to the detection power of the downstream DE analysis. For example, I would be fairly assured that I could detect DE if I had average counts of ~20 across my samples. If I had average counts of 2 across samples instead, my detection power would be a lot lower, and I doubt I would be able to consistently detect DE. The latter case should be removed during filtering to reduce the severity of the BH correction, as well as to ensure that the discreteness of low counts does not interfere with normalization and trend fitting.
For the second question - it depends on whether you can assume that most miRNAs are not DE across samples. If your are expecting a global up- or downregulation of miRNAs between conditions, you should not use the total miRNA count as a normalizing factor. This is because it will change between conditions for biological reasons, such that normalizing on it would remove the biology of interest. On the other hand, if you do assume that most miRNAs are not DE, then normalizing on the miRNA counts is the preferred approach, as it will eliminate any uninteresting biases in miRNA representation between samples (e.g., due to differences in miRNA capture efficiency).
Does that mean we should be including all small RNAs that we have counts for, ie, piRNAs, tRNAs, rRNAs in the library normalization if we are expecting high differential expression between samples?
Aaron is completely right about using miRNA if you don't expect total de-regulation. I would say that detect that is pretty complicated, because using everything can introduce a lot of bias, since the library preparation itself could be the cause of different amount of some specific small RNA, like rRNA or in the right side of the size distribution.
I would check if some other kind of small RNA show a difference in total number of reads. For instance, assuming a total de-regulation of miRNAs, you would see that one group have half number of reads mapping to miRNA, if you see that tRNA are constant, then you have a good reason to use miRNA / tRNA for the normalization. But, if always you see a difference in number of reads for any kind of small RNA type. Then, it's more complicate to decide what to do.
I would say, even if half of them are DE-regulated, you still can use edgeR/DESeq2/limma-voom options.
Thank you Aaron,
Does that mean we should be including all small RNAs that we have counts for, ie, piRNAs, tRNAs, rRNAs in the library normalization if we are expecting high differential expression between samples?
Best regards
Hi,
Aaron is completely right about using miRNA if you don't expect total de-regulation. I would say that detect that is pretty complicated, because using everything can introduce a lot of bias, since the library preparation itself could be the cause of different amount of some specific small RNA, like rRNA or in the right side of the size distribution.
I would check if some other kind of small RNA show a difference in total number of reads. For instance, assuming a total de-regulation of miRNAs, you would see that one group have half number of reads mapping to miRNA, if you see that tRNA are constant, then you have a good reason to use miRNA / tRNA for the normalization. But, if always you see a difference in number of reads for any kind of small RNA type. Then, it's more complicate to decide what to do.
I would say, even if half of them are DE-regulated, you still can use edgeR/DESeq2/limma-voom options.
As well, for that you can read more: https://www.researchgate.net/publication/315091565_Modeling_bias_and_variation_in_the_stochastic_processes_of_small_RNA_sequencing?ev=prf_high
hope this helps.