Subsetting data verses contrasts [DESeq2]?
1
0
Entering edit mode
js101 • 0
@70b80ec2
Last seen 7 days ago
United States

What is the best way to compare a subset of data for differential expression analysis. The vignette for DESeq2 says to use contrasts but I see most people in the literature don't use contrasts and pre-subset their data instead.

Can someone explain the best way to do this? What are the differences between the methods? I understand that subsetting the data alters the dispersion estimates but what about case #1 verses case #2 below, don't they both consider all the samples? Does altering the dispersion estimates matter?

For example, lets say that I have extracted RNA from treated and control tissue at two time points, 24 hours and 48 hours. I am interested in which genes are differentially expressed at 48 hours. The way I see it there are three ways to do this but I am not sure which one is correct.

  1. Use contrasts

This is the method that is recommended and in the vignette. Here I have a separate column in the metadata for each factor. For example, a column for treatment that has treated or control, and a column for time with 48h or 24h.

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = meta, #here the metadata would contain a column for treatment and for time 
  design = ~ time_point +
    treatment + 
    time_point:treatment)

res <- results(
  dds,
  contrast = list(
    c("treatment_inoculated_vs_control",
      "time_point48h.treatmenttreated")
  )
)
  1. subset the data using metadata

Here the metadata would contain a column (groups) that has treatment and time together instead of two separate columns. For example, treated_24, control_24, treated_48, control_48.

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = meta,
  design    = ~ groups
)

res <- results(dds, contrast = c("groups", "treated_48h", "control_48h"))
  1. pre-subset the data in R and then run deseq2.

Here I could use either metadata but I will use the one from #1.

48h_df <- meta[meta$time == "48h",] #I would also filter the count data 

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = 48h_df,
  design    = ~ treatment
)

res <- results(dds, contrast = c("treatment", "treated", "control"))

So which way is "correct"? What are the pros and cons?

Thank you!

DESeq2 • 217 views
ADD COMMENT
1
Entering edit mode
ATpoint ★ 5.0k
@atpoint-13662
Last seen 2 hours ago
Germany

This has been asked many times before, please search for relevant posts. In short: Generally it is least tedious to not split the data and use contrasts. A good argument for splitting is when the subset you are interested in is very different in composition or observed variability within groups compared to the rest of the dataset.

See also the FAQ: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups

ADD COMMENT

Login before adding your answer.

Traffic: 1322 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6