DESeq2: Multiple Comparisons with Different Conditions and Replicates
0
Entering edit mode
vanbelj • 10
@vanbelj-21216
Last seen 4 days ago

I want to compare publicly available RNA-Seq datasets. My end goal is to obtain a set of genes that are collectively up/down regulated across all datasets or in a subset of Treatments. Can I use a master sample table containing all datasets and use Interactions to direct desired comparisons, if the datasets contain a differing number of replicates and different treatments?

These datasets contain a differing number of biological replicates (2 to 4) and different treatment conditions (>5). About half of them have treatment time points, but the time points between datasets are different. I have quantified the Gene-Counts with Salmon.

I am not fluent in R, and so I originally intended to create a new sample table for each comparison that I wanted to make. However this quickly became unfeasible as I tried to manually rename each variable for the new comparison.

Next I tried creating a "master" sample table with the intention of subsetting the samples for comparison (a shortened version of the table is below). I intended to use Interactions, similar to what is describe for genotype comparisons in the Vignette. However, I either get the incorrect comparisons or the Model matrix not full rank error.

Essentially, for each dataset I want to compare the treated condition(s) to the control. Is this possible when the datasets have varying treatment conditions and a varying number of replicates? Or will I have to create independent sample tables for each comparison?

Thanks for any help!

#First I make a dds against all groups (grouped by biological replicates only)
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ Condition)

#Then I try to define Interactions
ddsTxi$group <- factor(paste0(ddsTxi$GSE, ddsTxi\$Group))
design(ddsTxi) <- ~ group
ddsTxi <- DESeq(ddsTxi)
results(ddsTxi, contrast=c("group", "Control", "Treatment"))


This code results in comparison to other datasets, but doesn't actually compare Control to Treatment within a dataset (using my full sample table)

> resultsNames(ddsTxi)
[1] "Intercept"                                    "group_GSE127851Treatment_vs_GSE127851Control"
[3] "group_GSE135568Control_vs_GSE127851Control"   "group_GSE135568Treatment_vs_GSE127851Control"
[5] "group_GSE140504Control_vs_GSE127851Control"   "group_GSE140504Treatment_vs_GSE127851Control"
[7] "group_GSE148166Control_vs_GSE127851Control"   "group_GSE148166Treatment_vs_GSE127851Control"
[9] "group_GSE152916Control_vs_GSE127851Control"   "group_GSE152916Treatment_vs_GSE127851Control"
[11] "group_GSE80512Control_vs_GSE127851Control"    "group_GSE80512Treatment_vs_GSE127851Control"
[13] "group_GSE94495Control_vs_GSE127851Control"    "group_GSE94495Treatment_vs_GSE127851Control"
[15] "group_GSE98352Control_vs_GSE127851Control"    "group_GSE98352Treatment_vs_GSE127851Control"


Example Sample Table

Name GSE SRA_Experiment Treatment Condition Group
WT1_0min-SRR9929263_quant GSE135568 SRR9929263 39C_0min 39C_0min_GSE135568 Control
WT1_20min-SRR9929265_quant GSE135568 SRR9929265 39C_20min 39C_20min_GSE135568 Treatment
WT1_120min-SRR9929266_quant GSE135568 SRR9929266 39C_120min 39C_120min_GSE135568 Treatment
WT2_0min-SRR9929264_quant GSE135568 SRR9929264 39C_0min 39C_0min_GSE135568 Control
WT2_20min-SRR9929271_quant GSE135568 SRR9929271 39C_20min 39C_20min_GSE135568 Treatment
WT2_120min-SRR9929272_quant GSE135568 SRR9929272 39C_120min 39C_120min_GSE135568 Treatment
WT3_0min-SRR9929273_quant GSE135568 SRR9929273 39C_0min 39C_0min_GSE135568 Control
WT3_20min-SRR9929274_quant GSE135568 SRR9929274 39C_20min 39C_20min_GSE135568 Treatment
WT3_120min-SRR9929279_quant GSE135568 SRR9929279 39C_120min 39C_120min_GSE135568 Treatment
WT4_0min-SRR9929282_quant GSE135568 SRR9929282 39C_0min 39C_0min_GSE135568 Control
WT4_20min-SRR9929280_quant GSE135568 SRR9929280 39C_20min 39C_20min_GSE135568 Treatment
WT4_120min-SRR9929281_quant GSE135568 SRR9929281 39C_120min 39C_120min_GSE135568 Treatment
WT1_0min-SRR12454580_quant GSE148166 SRR12454580 42C_0min 42C_0min_GSE148166 Control
WT1_16min-SRR12454584_quant GSE148166 SRR12454584 42C_16min 42C_16min_GSE148166 Treatment
WT2_0min-SRR12454581_quant GSE148166 SRR12454581 42C_0min 42C_0min_GSE148166 Control
WT2_16min-SRR12454585_quant GSE148166 SRR12454585 42C_16min 42C_16min_GSE148166 Treatment
WT3_0min-SRR12449208_quant GSE148166 SRR12449208 42C_0min 42C_0min_GSE148166 Control
WT3_16min-SRR12449209_quant GSE148166 SRR12449209 42C_16min 42C_16min_GSE148166 Treatment
Interactions DESeq2 Design • 106 views
1
Entering edit mode
@u16406
Last seen 8 hours ago
Republic of Ireland

Just to be sure, you obtained raw data and then quantified read counts using Salmon? Batch effects will likely still exist between the datasets, I think. You may want to consider adding dataset to your design formula so that any effect of dataset is adjusted, i.e., using ~ Condition + dataset. There are likely greater / 'grander' issues pertaining to merging datasets in this way, which are as yet not interpretable to me from where I am sitting.

Generally, though, why not just perform pairwise comparisons, as I show here: https://www.biostars.org/p/325009/#325036

I see no immediate need for an interaction design formula.

Essentially, for each dataset I want to compare the treated condition(s) to the control.

If this is all that you want to do, then I would process each dataset independently and then meta-analyse the p-values.

Kevin

0
Entering edit mode

Thank you for taking the time to respond.

Just to be sure, you obtained raw data and then quantified read counts using Salmon?

Yes, I downloaded raw data from NCBI's Gene Expression Omnibus and then quantified read counts using Salmon.

You may want to consider adding dataset to your design formula so that any effect of dataset is adjusted, i.e., using ~ Condition + dataset

The dataset "identifier" is the GSE column. Each dataset was performed by a different lab and has its own treatment and control group. However, the experimental conditions are reported as being the same (or similar). I want to identify genes that are up-regulated in all of the datasets.

It was my understanding that, when all samples are processed into a DESeqDataSet (dds) together, DESeq is capable of correcting for batch-to-batch variability (see footnote)? I am essentially treating each independent dataset as a batch, so was hoping that DESeq would correct for these differences. Further, processing the datasets together in one dds should prevent genes with low counts in one dataset from being in/excluded, if that same gene in another dataset has a count above the user-set threshold.

That is my rationale for processing together. Note that I intend only to make treatment vs. control comparisons within a dataset. The resulting "hit-lists", one from each dataset, will be compared to identify conserved hits, i.e. experiments performed by Lab 1, Lab 2, and Lab 3 all found <set of genes> to be up-regulated under these conditions.

I may have misunderstood these 'parameters' within DESeq, or I could be misapplying them. Any feedback is appreciated.

From the 2nd Note under Alternative Shrinkage Estimator section in DESeq2 Vignette:

"Note: If there is unwanted variation present in the data (e.g. batch effects) it is always recommend to correct for this, which can be accommodated in DESeq2 by including in the design any known batch variables..."

1
Entering edit mode

I see. I still think that processing each study separately (separate DESeq2 analyses) would be better, followed by a meta-analysis of the p-values.

When we want to adjust for a batch effect, we would usually have a design formula of the form: ~ Condition + batch

I am not sure that the way in which you are running this experiment, whereby batch is 'absorbed' into the Condition variable, is correcting for batch in the way that you think.

You can try both ways and see what you get.