Question: DESeq2- Another Multi-factor Question
0
12 months ago by
wlorenz0
wlorenz0 wrote:

I have a data set that I discovered had a secondary component (sex of the individuals) after I had already processed it without that factor.

I went back and reran the analysis including the second factor (see META file below); however, the results are a bit perplexing and I'm trying to figure out if i) I'm approaching the multi-factor analysis correctly and ii) an explanation of the result: two expt'l lab populations of the same species show an opposite effect w.r.t. inclusion of the second factor.

There are 4 groups (2 control and 2 expt'l) and there is also an identically designed expt. run for a second population.Reads were originally mapped to a de novo transcriptome using the Trinity RSEM wrapper and an expected count matrix was constructed using genes.results files.

The Expt'l set up follows:

META

Sample Contrast Secondary

L_rep1 EXPT1 M

L_rep2 EXPT1 F

L_rep3 EXPT1 M

L_rep4 EXPT1 F

L_rep5 EXPT1 F

L_rep6 EXPT1 M

S_rep1 EXPT2 M

S_rep2 EXPT2 F

S_rep3 EXPT2 M

S_rep4 EXPT2 F

S_rep5 EXPT2 F

S_rep6 EXPT2 M

LC_rep1 CTRL1 M

LC_rep2 CTRL1 M

LC_rep3 CTRL1 M

LC_rep4 CTRL1 F

LC_rep5 CTRL1 F

LC_rep6 CTRL1 F

SC_rep1 CTRL2 M

SC_rep2 CTRL2 M

SC_rep3 CTRL2 M

SC_rep4 CTRL2 F

SC_rep5 CTRL2 F

SC_rep6 CTRL2 F

#First time run without 2nd factor

dds <- DESeqDataSetFromMatrix(countData=COUNTS, colData=META, design = ~Contrast)

nrow(dds)

#204293

dds1 <- dds[ rowSums(counts(dds)) > 100, ]

#Check rows again

nrow(dds1)

#42785

res1 <- results(dds1, contrast=c("Contrast","EXPT1", "CTRL1"))

res1 <- results(dds1, contrast=c("Contrast","EXPT2", "CTRL2"))

#Now repeat above analysis and include the 2nd factor

dds <- DESeqDataSetFromMatrix(countData=COUNTS, colData=META, design = ~Secondary + Contrast)

In one population the response to treatment (infection) is very robust while in the second population the response is strong, but to a lesser degree than the first.  Inclusion of the secondary factor increases the total # of DEGs identified in the first set, but decreases the # of DEGs in the second set as shown below:

Experimental_set Comparison #DEGS @ FDR 0.01

Population 1 GroupL 6507

Population 1 GroupL_2nd_factor 7055

Population 1 GroupS 8303

Population 1 GroupS_2nd_factor 8879

Population 2 GroupL 593

Population 2 GroupL_2nd_factor 539

Population 2 GroupS 2504

Population 2 GroupS_2nd_factor 2398

For example, in the first population there are ~400 to 500 more DEGs identified with second factor inclusion, but in the second populationthere are ~50 to 100 fewer DEGs with second factor inclusion.  A priori I didn't know what expect (or if I'm explaining correctly), but the second result seems to makes more sense to me, that is there were genes identified originally whose expression differences were due to/driven by the sex of the individuals rather than by the treatment and these genes were identified and eliminated with inclusion of the 2nd factor.  But what about the first set where the 2nd factor leads to identification of more genes than without it?  I just didn't expect the two to go in opposite directions...  Is this something that is commonly seen?

deseq2 • 220 views
modified 12 months ago by Michael Love22k • written 12 months ago by wlorenz0
0
12 months ago by
Michael Love22k
United States
Michael Love22k wrote:

It can definitely go either way: you mention a plausible way that the list can be reduced. It can also increase if adding the variable helps to reduce the unexplained variance in the counts.

Thanks, Michael.  One more question.  In the final dds1 object I created for the 2nd factor analyses above there is a result called Secondary_M_vs_F.  Can you describe exactly what this represents?

For example, if I output it as follows: res3 <- results(dds1, contrast=c("Secondary","M", "F")) and write to a table, in Pop1 (where 2nd factor leads to increase in DEGs), about half of the genes listed in M_vs_F intersect with total DEGs from Groups L + S.  In Pop2 (where 2nd factor leads to decrease in DEGs), there are far fewer genes in M_vs_F and only 1 of them intersects with Groups L+ S.  In all cases I am filtering DEGs by the same FDR as above, 0.01

When you write ~A + B, R builds a design matrix which encodes a set of coefficients for each sample. You will get a coefficient for each non-reference level of A and likewise for B, which represent the differences to the reference level. resultsNames() could be called coefficientNames() and maybe it would be more straightforward, because whether the coefficients are "results" or nuisance variables is up to interpretation of the analyst and the dataset at hand. So when you run ~A + B with DESeq2, some of the coefficients are associated with differences associated with levels of A and some with differences associated with levels of B. This is often written in papers as "the effect of B controlling for A", but the design matrix doesn't care if you switch the order of A and B, or which one is the result and which one is the nuisance variable.