I am analyzing an experiment on the mouse olfactory epithelium. We have 3 treatment conditions and 1 control condition. Each condition contains 4 biological replicates. Our 3 treatment conditions each express a different isoform of transgenically up-regulated Adar protein. Adar is hypothesized as a negative regulator of circular RNAs (circRNAs). It's hypothesized mechanism of negative regulation is via the melting of complementary double-stranded RNA (dsRNA) by editing adenosine nucleotides to inosine, resulting in the equivalent of an A-to-G mutation.
We would like to investigate changes in circRNA expression in concert with rates of RNA-editing. For my initial DESeq2 analysis I combined linear read-counts generated from the Rsubread function featureCounts
with circRNA counts generated from CIRCexplorer2 and DCC. The combined count matrix was analyzed with the standard workflow detailed in the DESeq2 tutorial.
I then used the SPRINT toolkit to detect RNA-editing sites and events. This program also outputs files that contain every read that was edited in each sample. These files can then be converted to BAM format. I ran featureCounts
on the BAM files and now I'd like to run DESeq2 on this count-matrix of edited reads. There are a few potential problems with this. The treatments had different overall numbers of edited reads. These do not correlate with the original library sizes. As I said earlier, I would like to investigate differences in circRNA expression in concert with the differences in the edited circRNA parent gene expression, across treatments. I have a few questions about how best to proceed.
- For the edited-read count matrix, should I input size factors based on the original library sizes, or based on the number of edited reads? Or should I avoid inputing size factors and let the DESeq function estimate these?
- Should I compare the DESeq2 results from this edited-RNA count matrix with the circRNA results from the previous DESeq2 analysis, or should I combine edited-RNA counts with circRNA counts?
So far I have tried both inputting size-factors for the original library sizes, and inputting size factors for the edited-read file sizes. Both of these approaches resulted in a high number of low counts. When I refrained from manually entering size factors the number of low counts was within a reasonable range. Are these results valid? Can someone explain the high number of low counts when size factors are manually entered?
The following colData information contains DESeq2 estimated size-factors instead of the manually entered values.
Thanks for any insight anyone can provide as there is scant information available for this type of analysis.
```as.data.frame(colData(editedmRNAcircRNAdata))
group lib.size mapped totalcirc RPM sizeFactor
ACAGTG tetG2 48734047 43690194 1433 32.79912 0.9815202
ATGTCA tetG2 47375584 43503049 1178 27.07856 0.9365610
GTCCGC tetG2 45984196 43436819 819 18.85497 0.9262942
TGACCA tetG2 46642798 38909551 1280 32.89681 0.8002524
ACTTGA ADAR 49944630 46687050 807 17.28531 1.1794076
CAGATC ADAR 47044943 43348848 1047 24.15289 0.9328862
CCGTCC ADAR 54501566 49689028 1695 34.11216 1.0394567
GCCAAT ADAR 46477330 36707614 1358 36.99505 0.5723309
AGTCAA ADAR_E 44679063 39331402 1455 36.99334 0.9248037
AGTTCC ADAR_E 48939432 43793983 1629 37.19689 1.1727315
ATCACG ADAR_E 39105822 34824007 1706 48.98919 0.6974554
CGATGT ADAR_E 49688467 41789212 1630 39.00528 1.1576032
CTTGTA ADAR_O 63674462 50467694 1096 21.71686 1.5257405
GATCAG ADAR_O 43275574 38831849 554 14.26664 1.2713846
GGCTAC ADAR_O 54066928 47277414 688 14.55240 1.3834778
TAGCTT ADAR_O 42858919 36723900 572 15.57569 1.1759690```