I would like to ask for advice concerning the normalization of an RNA-seq dataset that is characterized by extremely skewed libraries due to a viral infection. Briefly, both human and viral RNAs were quantified, and as can be see in the table below, reads from viral RNA (= 10 genes) make up ~50% of the entire library in the infected condition. When running DESeq2 on this dataset, the estimated size factors effectively double the counts for cellular RNAs because they capture the trend of the 20.000 human genes, however, we have reason to believe that cellular RNAs are being degraded at the same time as viral RNAs are being produced based on the results of separate absorbance measurements. If cellular RNA levels were indeed unchanged, the samples should show a 50% increase of total RNA abundance whereas we observed that total RNA levels remained equal at best or even dropped slightly. Unfortunately, we did not use spike-in RNAs, so we have no way of using these as a reference for normalization.
I am therefore highly interested in your suggestions concerning appropriate size factors for the normalization, would it make sense to simply use a correction factor for the library size (i.e. divide by the total number of counts of "uninfected R1") for the size factors?
Any help is greatly appreciated!
edit: Additionally, since this behavior might mean that the majority of our genes are in fact down-regulated, one of the underlying assumptions for DESeq2 (most genes do not change) might be violated, so any recommendations on how to deal with such cases would be VERY helpful.
|sample||#reads cellular RNA||# reads viral RNA|