Question: DESeq2 multiple factors design for host-parasite interaction
0
3.3 years ago by
cs3jd0
cs3jd0 wrote:

Hi All,

I am working on RNAseq data of parasites from three populations (p1, p2, p3) during its interaction between two different hosts (h1 and h2). so finally I have 6 conditions (3 populations X 2 hosts) and each condition have 3 replicates. It has been observed that p1h1, p2h1, p3h1 and p2h2 have same phenotype (S) while p1h2 and p2h2 have another phenotype (R).

 parasite host phenotype p1 h1 S p1 h2 R p2 h1 S p2 h2 R p3 h1 S p3 h2 S

when it comes to DE design, I used two methods:

Method 1. subset read data and calculate DE genes for each factor (parasite, host) independently.

for example: to find DE genes during interaction with different hosts in parasite p1

dds <- DESeqDataSetFromMatrix(countData = ReadCount[which(parasite==p1),],
colData = condition,
design= ~ host )

dds <- DESeq(dds)

HostDiff.p1 <- results(dds, contrast=("host", "h1","h2")      # host effect on p1: 230 DE genes (padj < 0.01)

Method 2: using the whole data, calculate DE genes using multiple factor model including interaction term; then extract DE result for each factor (parasite, host)

dds <- DESeqDataSetFromMatrix(countData = ReadCount[which(parasite==p1),],
colData = condition,
design= ~ parasite + host + parasite:host )

dds <- DESeq(dds)

HostEffect <- results(dds, contrast=list("hosth1","hosth2"))             # main host effect: 1 DE

HostEffect.p1 <- results(cds, contrast=list("parasitep1.hosth1","parasitep1.hosth2"))  # interaction term on parasite p1: 62 DE

HostEffect2.p1 <- results(cds, contrast=list(c("hosth1","parasitep1.hosth1"),c("hosth2","parasitep1.hosth2"))  # host term on parasite p1: 117 DE

I found that number of differential expressed genes due to host difference are quite different based on two methods. HostEffect2.p1 and HostDiff.p1 shared only 35 contigs.

Then, here are my questions:

1. why were DE genes quite different based on two methods, although testing the same factor?

2. which method is more reliable?

3. What is difference on biological meaning  between 'interaction term' result (HostEffect.p1) and 'host term' result (HostEffect2.p1) ? According to the results help page, HostEffect.p1 means "host-parasite interaction effect for host in parasite p1" while HostEffect2.p1 is the sum of "main effect for host" and "interaction effect for host" in parasite p1. This explanation is not that clear to me. Since parasite factor is fixed in interaction term of p1, interaction difference relied only on host effect. while "main effect for host" is still the host effect but just a more general comparison among all parasites. Then if I want to know host effect on parasite p1, which one should I choose, HostEffect.p1 or HostEffect2.p1?

Thank you,

Chun

deseq2 • 791 views
modified 3.3 years ago by Michael Love23k • written 3.3 years ago by cs3jd0
Answer: DESeq2 multiple factors design for host-parasite interaction
1
3.3 years ago by
Michael Love23k
United States
Michael Love23k wrote:

1) The dispersion estimation relies on all the samples in the dds. So removing samples from the dds leads to different estimates of dispersion. See this other post for a description of how small changes to parameter estimates affects the number of differentially expressed genes (DEG):

A: Deseq2 results discrepancies because of versions of R

2) Generally, we recommend estimating dispersion using all of the samples, so not subsetting. One case where subsetting may give better results is if the inter-sample variability (e.g. the spread of points as you could see in a PCA plot -- see vignette) is vastly different across subsets.

3) The host effect for a given parasite is the interaction term plus the main effect term. You can perhaps take a look at the diagrams in the section on Interaction terms in the latest DESeq2 vignette. If it's still not clear, I'd suggest you consult a local statistician who can help interpret the meaning of terms here (which is not unique to DESeq2 but is the same as would result from a standard linear regression).

Hi Michael,

According to PCA plot as well as inter-sample correlation heatmap, libraries were basically clustered based on phenotype --- S and R as shown above. If I subset samples based on phenotype and then use design= ~ parasite + host + parasite:host on S group, it reported error as 'the model matrix is not full rank'. I understood that, it means, for some parasite (p1 and p2), I have only one host in the subseting samples and multiple-factor design cannot work in this case.  Does it mean that I have to use method 1 --- more fine subsetting read data and calculate DEGs for each factor (parasite, host) independently?

Thank you