DESeq2 multiple factors design for host-parasite interaction
1
0
Entering edit mode
cs3jd • 0
@cs3jd-9474
Last seen 8.4 years ago

Hi All,

I am working on RNAseq data of parasites from three populations (p1, p2, p3) during its interaction between two different hosts (h1 and h2). so finally I have 6 conditions (3 populations X 2 hosts) and each condition have 3 replicates. It has been observed that p1h1, p2h1, p3h1 and p2h2 have same phenotype (S) while p1h2 and p2h2 have another phenotype (R). 

parasite host phenotype
p1 h1 S
p1 h2 R
p2 h1 S
p2 h2 R
p3 h1 S
p3 h2 S

when it comes to DE design, I used two methods:

Method 1. subset read data and calculate DE genes for each factor (parasite, host) independently. 

for example: to find DE genes during interaction with different hosts in parasite p1

dds <- DESeqDataSetFromMatrix(countData = ReadCount[which(parasite==p1),], 
                                    colData = condition,
                                    design= ~ host )

dds <- DESeq(dds)

HostDiff.p1 <- results(dds, contrast=("host", "h1","h2")      # host effect on p1: 230 DE genes (padj < 0.01)

Method 2: using the whole data, calculate DE genes using multiple factor model including interaction term; then extract DE result for each factor (parasite, host)

dds <- DESeqDataSetFromMatrix(countData = ReadCount[which(parasite==p1),], 
                                    colData = condition,
                                    design= ~ parasite + host + parasite:host )

dds <- DESeq(dds)

HostEffect <- results(dds, contrast=list("hosth1","hosth2"))             # main host effect: 1 DE 

HostEffect.p1 <- results(cds, contrast=list("parasitep1.hosth1","parasitep1.hosth2"))  # interaction term on parasite p1: 62 DE 

HostEffect2.p1 <- results(cds, contrast=list(c("hosth1","parasitep1.hosth1"),c("hosth2","parasitep1.hosth2"))  # host term on parasite p1: 117 DE

I found that number of differential expressed genes due to host difference are quite different based on two methods. HostEffect2.p1 and HostDiff.p1 shared only 35 contigs. 

Then, here are my questions:

1. why were DE genes quite different based on two methods, although testing the same factor?

2. which method is more reliable?

3. What is difference on biological meaning  between 'interaction term' result (HostEffect.p1) and 'host term' result (HostEffect2.p1) ? According to the results help page, HostEffect.p1 means "host-parasite interaction effect for host in parasite p1" while HostEffect2.p1 is the sum of "main effect for host" and "interaction effect for host" in parasite p1. This explanation is not that clear to me. Since parasite factor is fixed in interaction term of p1, interaction difference relied only on host effect. while "main effect for host" is still the host effect but just a more general comparison among all parasites. Then if I want to know host effect on parasite p1, which one should I choose, HostEffect.p1 or HostEffect2.p1?

Thank you,

Chun

deseq2 • 1.3k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

Answers to your questions:

1) The dispersion estimation relies on all the samples in the dds. So removing samples from the dds leads to different estimates of dispersion. See this other post for a description of how small changes to parameter estimates affects the number of differentially expressed genes (DEG):

A: Deseq2 results discrepancies because of versions of R

2) Generally, we recommend estimating dispersion using all of the samples, so not subsetting. One case where subsetting may give better results is if the inter-sample variability (e.g. the spread of points as you could see in a PCA plot -- see vignette) is vastly different across subsets.

3) The host effect for a given parasite is the interaction term plus the main effect term. You can perhaps take a look at the diagrams in the section on Interaction terms in the latest DESeq2 vignette. If it's still not clear, I'd suggest you consult a local statistician who can help interpret the meaning of terms here (which is not unique to DESeq2 but is the same as would result from a standard linear regression).

 

ADD COMMENT
0
Entering edit mode

Hi Michael, 

Thank you for the clear answers. One additional question:

According to PCA plot as well as inter-sample correlation heatmap, libraries were basically clustered based on phenotype --- S and R as shown above. If I subset samples based on phenotype and then use design= ~ parasite + host + parasite:host on S group, it reported error as 'the model matrix is not full rank'. I understood that, it means, for some parasite (p1 and p2), I have only one host in the subseting samples and multiple-factor design cannot work in this case.  Does it mean that I have to use method 1 --- more fine subsetting read data and calculate DEGs for each factor (parasite, host) independently? 

Thank you

ADD REPLY
0
Entering edit mode

You can't use the interaction formula here because some of the parasites have only one host (and further there is an issue of no remaining degrees of freedom to fit such a model). When you write out that formula, R then tries to create a column for difference in the host effect for parasite p1, p2, p3, etc. Obviously there is no way to determine the differences in the host h2 vs h1 effect across p1, p2 and p3 because these samples don't exist in p1 and p2 for the S phenotype. I might suggest you first consult with a local statistician at your institute who can help translate your biological questions into hypotheses to test using R formula.

ADD REPLY

Login before adding your answer.

Traffic: 552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6