Question

DESeq2: Multiple Comparisons with Different Conditions and Replicates

0

Entering edit mode

vanbelj ▴ 30

@vanbelj-21216

Last seen 9 weeks ago

United States

I want to compare publicly available RNA-Seq datasets. My end goal is to obtain a set of genes that are collectively up/down regulated across all datasets or in a subset of Treatments. Can I use a master sample table containing all datasets and use Interactions to direct desired comparisons, if the datasets contain a differing number of replicates and different treatments?

These datasets contain a differing number of biological replicates (2 to 4) and different treatment conditions (>5). About half of them have treatment time points, but the time points between datasets are different. I have quantified the Gene-Counts with Salmon.

I am not fluent in R, and so I originally intended to create a new sample table for each comparison that I wanted to make. However this quickly became unfeasible as I tried to manually rename each variable for the new comparison.

Next I tried creating a "master" sample table with the intention of subsetting the samples for comparison (a shortened version of the table is below). I intended to use Interactions, similar to what is describe for genotype comparisons in the Vignette. However, I either get the incorrect comparisons or the Model matrix not full rank error.

Essentially, for each dataset I want to compare the treated condition(s) to the control. Is this possible when the datasets have varying treatment conditions and a varying number of replicates? Or will I have to create independent sample tables for each comparison?

Thanks for any help!

#First I make a dds against all groups (grouped by biological replicates only)
ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ Condition)

#Then I try to define Interactions
ddsTxi$group <- factor(paste0(ddsTxi$GSE, ddsTxi$Group))
design(ddsTxi) <- ~ group
ddsTxi <- DESeq(ddsTxi)
results(ddsTxi, contrast=c("group", "Control", "Treatment"))

This code results in comparison to other datasets, but doesn't actually compare Control to Treatment within a dataset (using my full sample table)

> resultsNames(ddsTxi)
 [1] "Intercept"                                    "group_GSE127851Treatment_vs_GSE127851Control"
 [3] "group_GSE135568Control_vs_GSE127851Control"   "group_GSE135568Treatment_vs_GSE127851Control"
 [5] "group_GSE140504Control_vs_GSE127851Control"   "group_GSE140504Treatment_vs_GSE127851Control"
 [7] "group_GSE148166Control_vs_GSE127851Control"   "group_GSE148166Treatment_vs_GSE127851Control"
 [9] "group_GSE152916Control_vs_GSE127851Control"   "group_GSE152916Treatment_vs_GSE127851Control"
[11] "group_GSE80512Control_vs_GSE127851Control"    "group_GSE80512Treatment_vs_GSE127851Control" 
[13] "group_GSE94495Control_vs_GSE127851Control"    "group_GSE94495Treatment_vs_GSE127851Control" 
[15] "group_GSE98352Control_vs_GSE127851Control"    "group_GSE98352Treatment_vs_GSE127851Control"

Example Sample Table

Name	GSE	SRA_Experiment	Treatment	Condition	Group
WT1_0min-SRR9929263_quant	GSE135568	SRR9929263	39C_0min	39C_0min_GSE135568	Control
WT1_20min-SRR9929265_quant	GSE135568	SRR9929265	39C_20min	39C_20min_GSE135568	Treatment
WT1_120min-SRR9929266_quant	GSE135568	SRR9929266	39C_120min	39C_120min_GSE135568	Treatment
WT2_0min-SRR9929264_quant	GSE135568	SRR9929264	39C_0min	39C_0min_GSE135568	Control
WT2_20min-SRR9929271_quant	GSE135568	SRR9929271	39C_20min	39C_20min_GSE135568	Treatment
WT2_120min-SRR9929272_quant	GSE135568	SRR9929272	39C_120min	39C_120min_GSE135568	Treatment
WT3_0min-SRR9929273_quant	GSE135568	SRR9929273	39C_0min	39C_0min_GSE135568	Control
WT3_20min-SRR9929274_quant	GSE135568	SRR9929274	39C_20min	39C_20min_GSE135568	Treatment
WT3_120min-SRR9929279_quant	GSE135568	SRR9929279	39C_120min	39C_120min_GSE135568	Treatment
WT4_0min-SRR9929282_quant	GSE135568	SRR9929282	39C_0min	39C_0min_GSE135568	Control
WT4_20min-SRR9929280_quant	GSE135568	SRR9929280	39C_20min	39C_20min_GSE135568	Treatment
WT4_120min-SRR9929281_quant	GSE135568	SRR9929281	39C_120min	39C_120min_GSE135568	Treatment
WT1_0min-SRR12454580_quant	GSE148166	SRR12454580	42C_0min	42C_0min_GSE148166	Control
WT1_16min-SRR12454584_quant	GSE148166	SRR12454584	42C_16min	42C_16min_GSE148166	Treatment
WT2_0min-SRR12454581_quant	GSE148166	SRR12454581	42C_0min	42C_0min_GSE148166	Control
WT2_16min-SRR12454585_quant	GSE148166	SRR12454585	42C_16min	42C_16min_GSE148166	Treatment
WT3_0min-SRR12449208_quant	GSE148166	SRR12449208	42C_0min	42C_0min_GSE148166	Control
WT3_16min-SRR12449209_quant	GSE148166	SRR12449209	42C_16min	42C_16min_GSE148166	Treatment

Interactions DESeq2 Design • 7.4k views

ADD COMMENT • link updated 3.3 years ago by Kevin Blighe ★ 3.9k • written 3.3 years ago by vanbelj ▴ 30

score 1 · Answer 1 · 2020-12-30

1

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 11 days ago

Republic of Ireland

Just to be sure, you obtained raw data and then quantified read counts using Salmon? Batch effects will likely still exist between the datasets, I think. You may want to consider adding dataset to your design formula so that any effect of dataset is adjusted, i.e., using ~ Condition + dataset. There are likely greater / 'grander' issues pertaining to merging datasets in this way, which are as yet not interpretable to me from where I am sitting.

Generally, though, why not just perform pairwise comparisons, as I show here: https://www.biostars.org/p/325009/#325036

I see no immediate need for an interaction design formula.

Essentially, for each dataset I want to compare the treated condition(s) to the control.

If this is all that you want to do, then I would process each dataset independently and then meta-analyse the p-values.

Kevin

ADD COMMENT • link 3.3 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Thank you for taking the time to respond.

Just to be sure, you obtained raw data and then quantified read counts using Salmon?

Yes, I downloaded raw data from NCBI's Gene Expression Omnibus and then quantified read counts using Salmon.

You may want to consider adding dataset to your design formula so that any effect of dataset is adjusted, i.e., using ~ Condition + dataset

The dataset "identifier" is the GSE column. Each dataset was performed by a different lab and has its own treatment and control group. However, the experimental conditions are reported as being the same (or similar). I want to identify genes that are up-regulated in all of the datasets.

It was my understanding that, when all samples are processed into a DESeqDataSet (dds) together, DESeq is capable of correcting for batch-to-batch variability (see footnote)? I am essentially treating each independent dataset as a batch, so was hoping that DESeq would correct for these differences. Further, processing the datasets together in one dds should prevent genes with low counts in one dataset from being in/excluded, if that same gene in another dataset has a count above the user-set threshold.

That is my rationale for processing together. Note that I intend only to make treatment vs. control comparisons within a dataset. The resulting "hit-lists", one from each dataset, will be compared to identify conserved hits, i.e. experiments performed by Lab 1, Lab 2, and Lab 3 all found <set of genes> to be up-regulated under these conditions.

I may have misunderstood these 'parameters' within DESeq, or I could be misapplying them. Any feedback is appreciated.

From the 2nd Note under Alternative Shrinkage Estimator section in DESeq2 Vignette:

"Note: If there is unwanted variation present in the data (e.g. batch effects) it is always recommend to correct for this, which can be accommodated in DESeq2 by including in the design any known batch variables..."

ADD REPLY • link 3.3 years ago vanbelj ▴ 30

1

Entering edit mode

I see. I still think that processing each study separately (separate DESeq2 analyses) would be better, followed by a meta-analysis of the p-values.

When we want to adjust for a batch effect, we would usually have a design formula of the form: ~ Condition + batch

I am not sure that the way in which you are running this experiment, whereby batch is 'absorbed' into the Condition variable, is correcting for batch in the way that you think.

You can try both ways and see what you get.

ADD REPLY • link 3.3 years ago Kevin Blighe ★ 3.9k