I've inherited some 10X single cell data of human cell lines infected with H1N1 virus and I am having trouble with the proper normalization. Proper normalization of viral-infected data has been a long-standing issue (I found a post of mine from 2006!) because viral infection greatly lowers host RNA production. We aligned to a combined host + virus reference to get counts for both host and the 8 viral genes.
The postdoc that started the analysis pretty much followed the pipeline in OSCA book (emptyDrops()
, filtering cells by min genes, filtering genes by min cells, removing doublets, normalizing with quickCluster()
and computeSumFactors
), but the effects of the normalization on infected cells is very weird. There were 3 libraries: infected cells selected by flow cytometry, the unselected "bystander" cells and mock infected cells that also underwent a separate flow cytometry to control for that process. A little less than half the infected cells had over 50% of their reads come from virus genes, and not surprisingly many fewer host genes were detected in the infected cells than the other two libraries:
https://uofi.box.com/s/5lw4gydlf6p9mhwqk70qgofys8dzgfg4
If if look at the total raw counts coming from host genes, again the infected cells have lower number of counts as I would expect:
https://uofi.box.com/s/jgqoinp783jdoralfz6ddh2cc3msao0u
However, the normalization drastically increased the normcount values (from logNormCounts(sce,log = FALSE)
) so that the infected cells now have way more total host counts:
https://uofi.box.com/s/l3c1zsud1e647qdqinx9rwzktos6dmqe
I don't think this is correct - I've seen this before where if all host genes are decreasing, those that are decreasing the least can appear to be "up-regulated" after normalization. Are the other alternative normalization methods anyone could suggest? ERCC spike ins were not used.
Thanks!
Having issues with embedded figures - I switched them to links on box