Including infectivity information to RNA-Seq analysis
Entering edit mode
Ben ▴ 50
Last seen 4 weeks ago
United States


I have an RNA-Seq dataset of samples infected with different viruses (virus 1-3) and an uninfected control sample (control).

To some extent, the infectivity of my starting samples varies depending on the virus used to infect the samples. I would like to know if there is a way to incorporate the information about the infection level of the samples into the RNA-Seq analysis?

In short, I assume that e.g. samples infected with virus 1, which show low infectivity, will also show a lower response in my RNA-Seq data. In contrast, samples infected with virus 3, which show high infectivity, will show a greater response in my RNA-Seq data. I also assume that, for example, if virus 1 samples had the same level of infection, these samples would be more similar to virus 3 samples. (An assumption that can easily be criticized, but I want to make it anyway). I know the extent of the infection and I could express this as e.g. virus 1 = 1/10 of virus 3 and so on. Is there a way to incorporate this information into the RNA-Seq, resulting in a dataset that models equally infected samples?


RNASeqData DESeq2 RNA-Seq RNASeq • 295 views
Entering edit mode
Last seen 7 days ago
United States

You don't incorporate the infection level, because that's confounded with the virus type. In other words, if you fit a model that has a factor that includes control, virus 1, virus 2, virus 3, and you want to compare virus 1 to control (as well as the other three virus types), and all of the virus 1 samples are expected to have the same infectivity, you cannot include infectivity in your model because saying the group is virus 1 already specifies the infectivity as well. Put another way, consider the following table.

      Type Infectivity
1  Control        None
2  Control        None
3  Control        None
4  Virus 1         Low
5  Virus 1         Low
6  Virus 1         Low
7  Virus 2         Mid
8  Virus 2         Mid
9  Virus 2         Mid
10 Virus 3        High
11 Virus 3        High
12 Virus 3        High

The information conveyed by these two columns is identical from a statistical perspective, as there are simply four groups and the samples in those four groups don't change, so the information provided by the Type column is exactly the same as the Infectivity column.

Entering edit mode

Thanks James for your comment!

Yes, I realize that if infectivity were another factor in my model, it would be identical to the already present factor of the virus itself, so... not adding anything in the end.

What I think I am looking for is something like a numerical variable that normalizes my RNA-Seq dataset. Something along the lines of normalizing count data based on infectivity. Or on the other hand, providing a numerical variable to the model that contains e.g. scaled fold changes of infectivity? Or weighing low infectivity samples differently than high infectivity samples?

Again, I think there are assumptions here that are questionable, and my current feeling is that this is not possible at all, but this was a comment I received and I want to make sure I'm not missing something.

I hope this clarifies what I am looking for.


Entering edit mode

What you want to do probably doesn't make sense. I get the idea that you might want to identify genes that change in excess of the expected infectivity. So you want to zero out the role of infectivity to show differences that are due to other factors. Or maybe virus 1 will only infect 1% of the cells, and virus 3 will infect 10%, and you want to control for the level of infection. But it's probably much more complicated than that.

I recently did an analysis of cell culture that was infected with a virus, and they did scRNA-Seq, which I originally thought was the dumbest thing ever (I mean scRNA-Seq on fibroblast cell culture? Come on). But they had the same goal - to identify changes that occur in the infected cells - and they couldn't really get at that using bulk analysis because it's a variable mixture of infected and uninfected cells, and there is no way to accurately assess the level of infection. And even if you could, how do you adjust for it in a bulk analysis in a way that is defensible?

By using scRNA-Seq and a hybrid genome that included the virus genome, we could identify infected and uninfected cells and then do pseudobulk analyses on the different cell types.

But even that analysis turned out to be way more complicated than it had any right to be. But then I'm just a dumb master's guy. Probably there are others here who might have good ideas?


Login before adding your answer.

Traffic: 771 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6