Normalization / size factors of samples with high expression of few genes (viral infection)
Entering edit mode
René ▴ 30
Last seen 3.7 years ago

Dear all,

I would like to ask for advice concerning the normalization of an RNA-seq dataset that is characterized by extremely skewed libraries due to a viral infection. Briefly, both human and viral RNAs were quantified, and as can be see in the table below, reads from viral RNA (= 10 genes) make up ~50% of the entire library in the infected condition. When running DESeq2 on this dataset, the estimated size factors effectively double the counts for cellular RNAs because they capture the trend of the 20.000 human genes, however, we have reason to believe that cellular RNAs are being degraded at the same time as viral RNAs are being produced based on the results of separate absorbance measurements. If cellular RNA levels were indeed unchanged, the samples should show a 50% increase of total RNA abundance whereas we observed that total RNA levels remained equal at best or even dropped slightly. Unfortunately, we did not use spike-in RNAs, so we have no way of using these as a reference for normalization.

I am therefore highly interested in your suggestions concerning appropriate size factors for the normalization, would it make sense to simply use a correction factor for the library size (i.e. divide by the total number of counts of "uninfected R1") for the size factors?

Any help is greatly appreciated!

Kind regards

edit: Additionally, since this behavior might mean that the majority of our genes are in fact down-regulated, one of the underlying assumptions for DESeq2 (most genes do not change) might be violated, so any recommendations on how to deal with such cases would be VERY helpful.


sample #reads cellular RNA # reads viral RNA
infected R1 33362929 31800331
infected R2 29730131 37140090
infected R3 34245143 28871640
uninfected R1 58173048 4516
uninfected R2 57644064 2920
uninfected R3 62098823 1557


deseq2 normalization sizefactors • 459 views
Entering edit mode
Last seen 19 hours ago
United States

Without spike-in, obviously, we're a bit at a loss like you say at figuring out what's going on.

If you want to not use the viral RNA for normalization, you can choose only non-viral genes and provide these to the controlGenes argument of estimateSizeFactors(). 

DESeq2 can normalize based on what you tell it, but obviously can't infer without a little user guidance what is the "true" technical scaling factor with global changes in counts.

Entering edit mode

Hi Michael, thanks for the quick response! I doubt that using the viral RNA for normalization will do any good since we are basically just looking at the other half of the sequencing reads compared to the standard normalization. As mentioned, the absorbance measurements after RNA extraction indicate that the total RNA concentration remains stable, do you think it would it make any sense to supply size factors based on total library size only?

Entering edit mode

I don’t have a sense for what’s the right way to normalize this data.


Login before adding your answer.

Traffic: 371 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6