Question

How to interpret coefficients of time course modeled data with DESeq2?

0

Entering edit mode

agsousa • 0

@agsousa-23429

Last seen 5.8 years ago

Hello,

I've some bulk-RNA-seq data and I want to perform differential gene expression analysis. This data came from a time course experiment where along a set of time points, wild-type and strain individuals were sampled.

I'm familiar with pairwise comparisons in differential gene expression analysis. However, since this data came from a longitudinal study, I think it would be interesting to search for genes differentially expressed over time instead across specific time points through pairwise comparisons of strain vs. WT (Population) at time 0, 4, 8, 12 hours (Time).

Therefore I was reading the section "9 Time course experiments" (from the vignette: http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#time-course-experiments ) in order to understand what to do and how to interpret the results. I tried my own with some data and it just worked fine. The code that I've run appears below:

condition <- c("S_0H", "WT_0H", "S_4H", "WT_4H", "S_8H", "WT_8H", "S_12H", "WT_12H") # conditions

## Metadata 
#
metaData <- data.frame(
  "Condition" = c(
    rep(condition, each = 3) 
  ),
  row.names = colnames(cds_ord)
)

coldata <- data.frame("Condition" = metaData$Condition, 
                      "Time" = stringr::str_remove_all(metaData$Condition, "^S_|^WT_"), 
                      "Population" = gsub("[^SWT]", "", metaData$Condition) ) %>% 
  `rownames<-`(rownames(metaData)) # get metadata to use in the heatmap
coldata$Condition <- factor( coldata$Condition, 
                             levels = condition ) # order 'Condition' var
coldata$Time <- factor( coldata$Time, 
                        levels = c("0H", "4H", "8H", "12H") ) # order 'Time' var

### DESeq2 pipeline - time dependent model

## Metadata
#
metaData <- coldata
metaData$Population <- factor(metaData$Population, 
                              levels = c("WT", "S"))

## DESeq2 pipeline
#
set.seed(2020) # set seed
deseq <- DESeqDataSetFromMatrix(countData = cds_ord, 
                                colData = metaData, 
                                design = ~ Population + Time + Population:Time) # import data to DESeq2

deseq_time <- DESeq( object = deseq, test = "LRT", reduced = ~ Population + Time )
dge_time <- results( object = deseq_time, 
                     alpha = 0.05 )

## Format the DGE tbl
#
dge_time <- dge_time %>% 
    as.data.frame(.) %>% 
    mutate("Geneid" = rownames(.)) %>% # add 'Geneid' col
    select(c("Geneid", colnames(.)[colnames(.) != "Geneid"])) %>% # order cols
    arrange(padj)

## Save the DGE tbl
#
write_tsv(x = dge_time, 
          path = "../results/tables/dge_pop_time_exp.tsv")

comps <- resultsNames(deseq_time)[-1]
for ( ele in comps ) {
  data_tbl <- results( object = deseq_time, name = ele, test = "Wald" )
  data_tbl <- data_tbl %>%
    as.data.frame(.) %>%
    mutate("Geneid" = rownames(.)) %>% # add 'Geneid' col
    select(c("Geneid", colnames(.)[colnames(.) != "Geneid"])) %>% # order cols
    arrange(padj)
  write_tsv( x = data_tbl, path = paste0("../results/tables/", ele, "_pop_time_exp_dge.tsv") )
}

However I'm not sure how I should interpret the results. At this point I have the table dge_pop_time_exp.tsv that results from dge_time object and 7 tables where I performed the Wald tests for each coefficient (I did not include the interception): "Population_S_vs_WT", "Time_4H_vs_0H", "Time_8H_vs_0H", "Time_12H_vs_0H", "PopulationS.Time4H", "PopulationS.Time8H", "PopulationS.Time12H".

So, my questions are:

(1) What contains the dge_time/dge_pop_time_exp.tsv?

When I performed this I thought that included all the genes differentially expressed which at one or more time points after time 0 showed a population-specific effect. However when I print the dge_time appears:

`log2 fold change (MLE): PopulationS.Time12H

LRT p-value: '~ Population + Time + Population:Time' vs '~ Population + Time' `

Since it appeared the interaction I'm a bit confusing.

(2) Which comparisons were made in the other tables obtained with Wald tests?

From the vignette and this post ( https://www.biostars.org/p/188808/ ) I would say that tables "Time_4H_vs_0H", "Time_8H_vs_0H", "Time_12H_vs_0H" contain the genes differentially expressed between the time points highlighted, i.e., 4H vs. 0H, 8H vs. 0H, 12H vs. 0H, for the WT population. I would say that the table "Population_S_vs_WT contains the genes differentially expressed between populations, i.e., population S vs. pop. WT, at time 0H. The interaction is more difficult to interpret. From what I've read in the documentation and Biostar post, for instance in the case of "PopulationS.Time4H" it contains the genes differentially expressed across population S vs. WT, at time point 4H and also the "additional difference" between population S vs. WT, at time point 0H. Is that right?

Sorry for the long post.

Thank you in advance.

António

deseq2 RNA • 1.3k views

ADD COMMENT • link updated 5.8 years ago by Michael Love 43k • written 5.8 years ago by agsousa • 0

score 1 · Answer 1 · 2020-04-28

For interpretation or coefficients in a linear model with interactions, read over the interactions section of the vignette and if it’s still not clear you will have to find a statistical collaborator who can walk you through the interpretation. I’ve tried to make it clear in the vignette with diagrams but it’s important to make sure you’ve got it correct. I have to limit myself here out of time to questions about software usage, and questions about linear models are important but these are best answered by consulting with a local statistician.