Dear all,
I would like to ask for advice concerning the normalization of an RNA-seq dataset that is characterized by extremely skewed libraries due to a viral infection. Briefly, both human and viral RNAs were quantified, and as can be see in the table below, reads from viral RNA (= 10 genes) make up ~50% of the entire library in the infected condition. When running DESeq2 on this dataset, the estimated size factors effectively double the counts for cellular RNAs because they capture the trend of the 20.000 human genes, however, we have reason to believe that cellular RNAs are being degraded at the same time as viral RNAs are being produced based on the results of separate absorbance measurements. If cellular RNA levels were indeed unchanged, the samples should show a 50% increase of total RNA abundance whereas we observed that total RNA levels remained equal at best or even dropped slightly. Unfortunately, we did not use spike-in RNAs, so we have no way of using these as a reference for normalization.
I am therefore highly interested in your suggestions concerning appropriate size factors for the normalization, would it make sense to simply use a correction factor for the library size (i.e. divide by the total number of counts of "uninfected R1") for the size factors?
Any help is greatly appreciated!
Kind regards
edit: Additionally, since this behavior might mean that the majority of our genes are in fact down-regulated, one of the underlying assumptions for DESeq2 (most genes do not change) might be violated, so any recommendations on how to deal with such cases would be VERY helpful.
sample | #reads cellular RNA | # reads viral RNA |
---|---|---|
infected R1 | 33362929 | 31800331 |
infected R2 | 29730131 | 37140090 |
infected R3 | 34245143 | 28871640 |
uninfected R1 | 58173048 | 4516 |
uninfected R2 | 57644064 | 2920 |
uninfected R3 | 62098823 | 1557 |
Hi Michael, thanks for the quick response! I doubt that using the viral RNA for normalization will do any good since we are basically just looking at the other half of the sequencing reads compared to the standard normalization. As mentioned, the absorbance measurements after RNA extraction indicate that the total RNA concentration remains stable, do you think it would it make any sense to supply size factors based on total library size only?
I don’t have a sense for what’s the right way to normalize this data.