Question

Isolate contrasts in DESeq2 while using a combination of categorical and continuous variables

0

Entering edit mode

tiansy • 0

@db656029

Last seen 2.1 years ago

Germany

Dear community members, I am currently using DESeq2 to analyze RNASeq data from plant roots that were colonized by a fungus. But I am facing some problems regarding contrast with a continuous variable. If you have any suggestions or comments it would be great if you can let me know. Thank you in advance!

The purpose of the project is to identify downstream targets of a few plant transcription factors (let’s call them gene A,B,C,D,E,F) that are essential for colonization of a fungus. As experimental set-up, wildtype (WT) plants and mutants of those transcription factors (let’s call them mutant_A, mutant_B, mutant_C, mutant_D, mutant_E, mutant_F) were treated without or with fungal inoculation (let’s call them either mock or inoculated). Then root samples were harvested and used for transcriptome analyses. Either 3 or 4 replicates were used for each genotype x treatment.

However, fungal colonization itself induces a big change of plant transcriptome, that means thousands of genes that are essential for colonization are induced in colonizaed root parts, while genes that are inhibiting colonization are suppressed. As mutants used in this research missing genes that are essential for colonization, I got a quantitative difference in fungal colonization level in my mutants compared to wildtype. This can be reflected by the proportion of fungal reads in each sample. Please see table below. In general, I have ~12-18% of fungal reads in my WT, but only ~6-9% in mutant F, ~2-4% in mutant E, ~0.2-0.5% in mutant C and D..etc.

Sample  % Fungal reads
WT_Mock1    0.00682819
WT_Mock2    0.008894
WT_Mock3    0.00818782
WT_Mock4    0.00665274
WT_Inoculated1  12.3860566
WT_Inoculated2  18.1908555
WT_Inoculated3  18.4706838
WT_Inoculated4  16.5686164
Mutant_A_Mock1  0.01233013
Mutant_A_Mock2  0.01424303
Mutant_A_Mock3  0.0073198
Mutant_A_Mock4  0.00603092
Mutant_A_Inoculated1    0.01972266
Mutant_A_Inoculated2    0.01211499
Mutant_A_Inoculated3    0.0165109
Mutant_A_Inoculated4    0.01508397
Mutant_B_Mock1  0.00791082
Mutant_B_Mock2  0.00667513
Mutant_B_Mock3  0.0069918
Mutant_B_Mock4  0.00672951
Mutant_B_Inoculated1    0.57389944
Mutant_B_Inoculated2    0.37813846
Mutant_B_Inoculated3    0.02073746
Mutant_B_Inoculated4    0.06448313
Mutant_C_Mock1  0.00470295
Mutant_C_Mock2  0.00802484
Mutant_C_Mock3  0.00217021
Mutant_C_Inoculated1    0.1567094
Mutant_C_Inoculated2    0.46984465
Mutant_C_Inoculated3    0.32510805
Mutant_D_Mock1  0.0035514
Mutant_D_Mock2  0.06521256
Mutant_D_Mock3  0.00407177
Mutant_D_Mock4  0.00368014
Mutant_D_Inoculated1    0.50073689
Mutant_D_Inoculated2    0.27329334
Mutant_D_Inoculated3    0.32021375
Mutant_D_Inoculated4    0.20501901
Mutant_E_Mock1  0.00529136
Mutant_E_Mock2  0.00808311
Mutant_E_Mock3  0.0069625
Mutant_E_Mock4  0.00199783
Mutant_E_Inoculated1    1.90555662
Mutant_E_Inoculated2    2.60344722
Mutant_E_Inoculated3    3.84246065
Mutant_E_Inoculated4    2.48234016
Mutant_F_Mock1  0.00801506
Mutant_F_Mock2  0.00715424
Mutant_F_Mock3  0.00829953
Mutant_F_Mock4  0.00894405
Mutant_F_Inoculated1    8.36066198
Mutant_F_Inoculated2    6.37187557
Mutant_F_Inoculated3    9.0976988
Mutant_F_Inoculated4    7.07773639

I am interested in the plant genes differentlly regulated in the mutants vs WT. However, due to the different colonization level, all colonization-associated genes are differientially expressed if I simply compare my samples by define them as either Inoculated or Mock. To (kind of) correct for the effect of different fungal colonization level on gene expression, I integrated % of fungal reads into my design formula, instead of using categorical variable (e.g. Inoculated and Mock).

My design is:

design= ~ Fungal_biomass + Mutation + Mutation:Fungal_biomass

Mutation: define the genotype, for example WT or Mutant_A. Fungal_biomass: percentage of fungal reads after scaling, I took care that they are numeric.

I got a first question here: After running the model, I got an error message here(please see below). I found that some people had same problem before. Then I followed their solutions. But either filter out low expressed genes or using large maxit such as 50000 didn’t solve this problem.

## 12099 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

I continued anyway to inspect my data. Then I tried to isolate some contrasts. I got my second question here: I am a bit confused with isolating contrast of a continuous variable. For example, by using code below, do I isolate fungal colonization-associated plant genes in my reference level (here WT) or in all samples?

res <- results(dds,name = "Fungal_biomass") 
summary(res$padj < 0.05 & res$log2FoldChange >1)
##    Mode   FALSE    TRUE 
## logical   22596     410

I thought I isolated colonization-induced genes in WT because it's the reference level, but 410 is much less than the number I expected. If I directly compare WT_Inoculated vs WT_Mock (set a categorical variable as either Inoculated or Mock), I can identify ~1400 genes using the same cut-off.

I tried another strategy to isolate contrast. In my metatable, I have a column named “Fungal_treatment” defining my samples are Inoculated or mock (although this column was not used in my design formula). But I managed to isolate contrast in this way:

WT_Inoculated <- colMeans(mod_mat[dds$Mutation == "WT" & dds$Fungal_treatment == "Inoculated",])
WT_mock <- colMeans(mod_mat[dds$Mutation == "WT" & dds$Fungal_treatment == "Mock",])
res_WT_Inoculated_vs_mock <- results(dds, contrast = WT_Inoculated - WT_mock)
summary(res_WT_Inoculated_vs_mock <- $padj < 0.05 & res_WT_Inoculated_vs_mock <- $log2FoldChange >1)
##    Mode   FALSE    TRUE 
## logical   21801    1205

In this way, I got a number that is expected (although slightly less than 1400). Here I have my third question: did I isolate the same contrast usign two methods? I thought so but the numbers are so different so I was confused.

Then the Last question: If I want to identify genes that were not induced by fungal colonization in mutant_F but were induced in WT while controling for different fungal colonization level in mutant F and WT (so those genes might be downstream target of gene F, but I want to rule out genes that were affected by the ~50% lower colonization level in mutant F compared to WT), shall I call results(dds,name = "Fungal_biomass.Mutation_Mutant_F" ), or results(dds,name = c(list (“Fungal_biomass”), list("Fungal_biomass.Mutation_Mutant_F" ))?

Below shows my resultnames:

resultsNames(dds)
##  [1] "Intercept"                               
##  [2] "Fungal_biomass"   
##  [3] "Mutation_Mutant_A_vs_WT"                        
##  [4] "Mutation_Mutant_B_vs_WT"        
##  [5] "Mutation_Mutant_C_vs_WT"           
##  [6] "Mutation_Mutant_D_vs_WT"           
##  [7] "Mutation_Mutant_E_vs_WT"           
##  [8] "Mutation_Mutant_F_vs_WT"           
## [9] "Fungal_biomass.Mutation_Mutant_A"   
## [10] "Fungal_biomass.Mutation_Mutant_B"  
## [11] "Fungal_biomass.Mutation_Mutant_C"  
## [12] "Fungal_biomass.Mutation_Mutant_D"  
## [13] "Fungal_biomass.Mutation_Mutant_E"  
## [14] "Fungal_biomass.Mutation_Mutant_F"

Any answers are appreciated. Thanks a lot!

Contrast design DESeq2 continuous • 735 views

ADD COMMENT • link updated 2.4 years ago by Michael Love 41k • written 2.4 years ago by tiansy • 0

score 0 · Answer 1 · 2021-12-06

I'd recommend collaborating with a local statistician or someone familiar with linear models in R, in order to plan the design and analysis. Unfortunately, I have to restrict my time on the support site to software related questions, but can't field statistical consultation questions here.

Regarding the non-converging rows, can you center and scale fungal biomass before running DESeq()? E.g.:

mu <- mean(dds$Fungal_biomass)
sigma <- sd(dds$Fungal_biomass)
dds$Fungal_biomass <- (dds$Fungal_biomass - mu) / sigma
# then DESeq()

this is usually printed as instructions when you create a dds object with a numeric covariate, unless you are using an older version of DESeq2.