Hi All,
I am working on RNAseq data of parasites from three populations (p1, p2, p3) during its interaction between two different hosts (h1 and h2). so finally I have 6 conditions (3 populations X 2 hosts) and each condition have 3 replicates. It has been observed that p1h1, p2h1, p3h1 and p2h2 have same phenotype (S) while p1h2 and p2h2 have another phenotype (R).
parasite | host | phenotype |
p1 | h1 | S |
p1 | h2 | R |
p2 | h1 | S |
p2 | h2 | R |
p3 | h1 | S |
p3 | h2 | S |
when it comes to DE design, I used two methods:
Method 1. subset read data and calculate DE genes for each factor (parasite, host) independently.
for example: to find DE genes during interaction with different hosts in parasite p1
dds <- DESeqDataSetFromMatrix(countData = ReadCount[which(parasite==p1),], colData = condition, design= ~ host ) dds <- DESeq(dds) HostDiff.p1 <- results(dds, contrast=("host", "h1","h2") # host effect on p1: 230 DE genes (padj < 0.01)
Method 2: using the whole data, calculate DE genes using multiple factor model including interaction term; then extract DE result for each factor (parasite, host)
dds <- DESeqDataSetFromMatrix(countData = ReadCount[which(parasite==p1),], colData = condition, design= ~ parasite + host + parasite:host ) dds <- DESeq(dds) HostEffect <- results(dds, contrast=list("hosth1","hosth2")) # main host effect: 1 DE HostEffect.p1 <- results(cds, contrast=list("parasitep1.hosth1","parasitep1.hosth2")) # interaction term on parasite p1: 62 DE HostEffect2.p1 <- results(cds, contrast=list(c("hosth1","parasitep1.hosth1"),c("hosth2","parasitep1.hosth2")) # host term on parasite p1: 117 DE
I found that number of differential expressed genes due to host difference are quite different based on two methods. HostEffect2.p1 and HostDiff.p1 shared only 35 contigs.
Then, here are my questions:
1. why were DE genes quite different based on two methods, although testing the same factor?
2. which method is more reliable?
3. What is difference on biological meaning between 'interaction term' result (HostEffect.p1) and 'host term' result (HostEffect2.p1) ? According to the results
help page, HostEffect.p1
means "host-parasite interaction effect for host in parasite p1" while HostEffect2.p1
is the sum of "main effect for host" and "interaction effect for host" in parasite p1. This explanation is not that clear to me. Since parasite factor is fixed in interaction term of p1, interaction difference relied only on host effect. while "main effect for host" is still the host effect but just a more general comparison among all parasites. Then if I want to know host effect on parasite p1, which one should I choose, HostEffect.p1
or HostEffect2.p1
?
Thank you,
Chun
Hi Michael,
Thank you for the clear answers. One additional question:
According to PCA plot as well as inter-sample correlation heatmap, libraries were basically clustered based on phenotype --- S and R as shown above. If I subset samples based on phenotype and then use
design= ~ parasite + host + parasite:host
on S group, it reported error as 'the model matrix is not full rank
'. I understood that, it means, for some parasite (p1 and p2), I have only one host in the subseting samples and multiple-factor design cannot work in this case. Does it mean that I have to use method 1 --- more fine subsetting read data and calculate DEGs for each factor (parasite, host) independently?Thank you
You can't use the interaction formula here because some of the parasites have only one host (and further there is an issue of no remaining degrees of freedom to fit such a model). When you write out that formula, R then tries to create a column for difference in the host effect for parasite p1, p2, p3, etc. Obviously there is no way to determine the differences in the host h2 vs h1 effect across p1, p2 and p3 because these samples don't exist in p1 and p2 for the S phenotype. I might suggest you first consult with a local statistician at your institute who can help translate your biological questions into hypotheses to test using R formula.