Question

Normalization / size factors of samples with high expression of few genes (viral infection)

1

Entering edit mode

René ▴ 40

@rene-5748

Last seen 8.0 years ago

Netherlands

Dear all,

I would like to ask for advice concerning the normalization of an RNA-seq dataset that is characterized by extremely skewed libraries due to a viral infection. Briefly, both human and viral RNAs were quantified, and as can be see in the table below, reads from viral RNA (= 10 genes) make up ~50% of the entire library in the infected condition. When running DESeq2 on this dataset, the estimated size factors effectively double the counts for cellular RNAs because they capture the trend of the 20.000 human genes, however, we have reason to believe that cellular RNAs are being degraded at the same time as viral RNAs are being produced based on the results of separate absorbance measurements. If cellular RNA levels were indeed unchanged, the samples should show a 50% increase of total RNA abundance whereas we observed that total RNA levels remained equal at best or even dropped slightly. Unfortunately, we did not use spike-in RNAs, so we have no way of using these as a reference for normalization.

I am therefore highly interested in your suggestions concerning appropriate size factors for the normalization, would it make sense to simply use a correction factor for the library size (i.e. divide by the total number of counts of "uninfected R1") for the size factors?

Any help is greatly appreciated!

Kind regards

edit: Additionally, since this behavior might mean that the majority of our genes are in fact down-regulated, one of the underlying assumptions for DESeq2 (most genes do not change) might be violated, so any recommendations on how to deal with such cases would be VERY helpful.

sample	#reads cellular RNA	# reads viral RNA
infected R1	33362929	31800331
infected R2	29730131	37140090
infected R3	34245143	28871640
uninfected R1	58173048	4516
uninfected R2	57644064	2920
uninfected R3	62098823	1557

deseq2 normalization sizefactors • 1.7k views

ADD COMMENT • link updated 8.0 years ago by Michael Love 43k • written 8.0 years ago by René ▴ 40

score 0 · Answer 1 · 2018-02-02

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Without spike-in, obviously, we're a bit at a loss like you say at figuring out what's going on.

If you want to not use the viral RNA for normalization, you can choose only non-viral genes and provide these to the controlGenes argument of estimateSizeFactors().

DESeq2 can normalize based on what you tell it, but obviously can't infer without a little user guidance what is the "true" technical scaling factor with global changes in counts.

ADD COMMENT • link 8.0 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, thanks for the quick response! I doubt that using the viral RNA for normalization will do any good since we are basically just looking at the other half of the sequencing reads compared to the standard normalization. As mentioned, the absorbance measurements after RNA extraction indicate that the total RNA concentration remains stable, do you think it would it make any sense to supply size factors based on total library size only?