Question

Issues with normalizing viral-infected single cell data

0

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 11 months ago

United States

I've inherited some 10X single cell data of human cell lines infected with H1N1 virus and I am having trouble with the proper normalization. Proper normalization of viral-infected data has been a long-standing issue (I found a post of mine from 2006!) because viral infection greatly lowers host RNA production. We aligned to a combined host + virus reference to get counts for both host and the 8 viral genes.

The postdoc that started the analysis pretty much followed the pipeline in OSCA book (emptyDrops(), filtering cells by min genes, filtering genes by min cells, removing doublets, normalizing with quickCluster() and computeSumFactors), but the effects of the normalization on infected cells is very weird. There were 3 libraries: infected cells selected by flow cytometry, the unselected "bystander" cells and mock infected cells that also underwent a separate flow cytometry to control for that process. A little less than half the infected cells had over 50% of their reads come from virus genes, and not surprisingly many fewer host genes were detected in the infected cells than the other two libraries:

https://uofi.box.com/s/5lw4gydlf6p9mhwqk70qgofys8dzgfg4

If if look at the total raw counts coming from host genes, again the infected cells have lower number of counts as I would expect:

https://uofi.box.com/s/jgqoinp783jdoralfz6ddh2cc3msao0u

However, the normalization drastically increased the normcount values (from logNormCounts(sce,log = FALSE)) so that the infected cells now have way more total host counts:

https://uofi.box.com/s/l3c1zsud1e647qdqinx9rwzktos6dmqe

I don't think this is correct - I've seen this before where if all host genes are decreasing, those that are decreasing the least can appear to be "up-regulated" after normalization. Are the other alternative normalization methods anyone could suggest? ERCC spike ins were not used.

Thanks!

scan scater single cell normalization • 1.9k views

ADD COMMENT • link updated 5.7 years ago by Aaron Lun ★ 29k • written 5.7 years ago by Jenny Drnevich ★ 2.0k

0

Entering edit mode

Having issues with embedded figures - I switched them to links on box

ADD REPLY • link 5.7 years ago Jenny Drnevich ★ 2.0k

score 0 · Answer 1 · 2020-03-26

0

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 3 hours ago

The city by the bay

computeSumFactors can struggle with such extreme cases of DE. The cleanest solution is to probably just remove the viral genes when computing the size factors, using subset.row= in computeSumFactors. This only affects the calculation of the size factors, the viral genes are retained for the rest of the analysis.

The brave can try using the scaling= option by running computeSumFactors twice. I don't use that much, so I'd be interested so see whether that improves matters; simply feed the size factors from the first run into scaling= for the second run. This probably won't be as good as ignoring the offending viral genes in the first place.

ADD COMMENT • link 5.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Hi Aaron - thanks for your answer. I did try removing the viral genes before normalization, but they don't seem to make any difference - the normcount for the infected cells still get greatly increased. It must be because the underlying gene distributions are so radically different. Would it be best to do no normalization in this case? Or normalize each library separately? I used that approach way back in 2006 for the Affymetrix data I had because I needed to do something to get probe set data and that caused fewer radical shifts in expression values than normalizing together.

And the postdoc had previously done the double computeSumFactors like this:

clusters <- quickCluster(sce, use.ranks=FALSE, BSPARAM=IrlbaParam())

##create size factors for normalizing within clusters
sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters)
sf <- sce@int_colData@listData$size_factor
sce <- computeSumFactors(sce, min.mean=0.1,scaling = sf)

I ended up taking it out because I couldn't really find any documentation on what it was doing and didn't really seem to change the final nomcounts very much. Well, taking a look at it again it does seem to not compress the normcounts for the mock and bystander cells as much:

https://uofi.box.com/s/3cxc2wy9r5b5tdkzhk7bs4t9e10i9a4w

I'm off to try Seurat's different normalizations but I suspect they will have the same effects.

ADD REPLY • link 5.7 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

It would be worth looking closer at those cells with greatly increased total counts after normalization. This suggests they have very small size factors because - even after removal of viral genes - those cells still have their transcriptomes dominated by a small number of genes. In that case, computeSumFactors will correctly remove the composition biases, leading to the effect observed here.

If you are confident that this is wrong, just use scater::computeLibraryFactors on the host genes.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 29k