I want to compare publicly available RNA-Seq datasets. My end goal is to obtain a set of genes that are collectively up/down regulated across all datasets or in a subset of Treatments. Can I use a master sample table containing all datasets and use Interactions to direct desired comparisons, if the datasets contain a differing number of replicates and different treatments?
These datasets contain a differing number of biological replicates (2 to 4) and different treatment conditions (>5). About half of them have treatment time points, but the time points between datasets are different. I have quantified the Gene-Counts with Salmon.
I am not fluent in R, and so I originally intended to create a new sample table for each comparison that I wanted to make. However this quickly became unfeasible as I tried to manually rename each variable for the new comparison.
Next I tried creating a "master" sample table with the intention of subsetting the samples for comparison (a shortened version of the table is below). I intended to use Interactions, similar to what is describe for genotype comparisons in the Vignette. However, I either get the incorrect comparisons or the Model matrix not full rank error.
Essentially, for each dataset I want to compare the treated condition(s) to the control. Is this possible when the datasets have varying treatment conditions and a varying number of replicates? Or will I have to create independent sample tables for each comparison?
Thanks for any help!
#First I make a dds against all groups (grouped by biological replicates only)
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ Condition)
#Then I try to define Interactions
ddsTxi$group <- factor(paste0(ddsTxi$GSE, ddsTxi$Group))
design(ddsTxi) <- ~ group
ddsTxi <- DESeq(ddsTxi)
results(ddsTxi, contrast=c("group", "Control", "Treatment"))
This code results in comparison to other datasets, but doesn't actually compare Control to Treatment within a dataset (using my full sample table)
> resultsNames(ddsTxi)
[1] "Intercept" "group_GSE127851Treatment_vs_GSE127851Control"
[3] "group_GSE135568Control_vs_GSE127851Control" "group_GSE135568Treatment_vs_GSE127851Control"
[5] "group_GSE140504Control_vs_GSE127851Control" "group_GSE140504Treatment_vs_GSE127851Control"
[7] "group_GSE148166Control_vs_GSE127851Control" "group_GSE148166Treatment_vs_GSE127851Control"
[9] "group_GSE152916Control_vs_GSE127851Control" "group_GSE152916Treatment_vs_GSE127851Control"
[11] "group_GSE80512Control_vs_GSE127851Control" "group_GSE80512Treatment_vs_GSE127851Control"
[13] "group_GSE94495Control_vs_GSE127851Control" "group_GSE94495Treatment_vs_GSE127851Control"
[15] "group_GSE98352Control_vs_GSE127851Control" "group_GSE98352Treatment_vs_GSE127851Control"
Example Sample Table
Name | GSE | SRA_Experiment | Treatment | Condition | Group |
---|---|---|---|---|---|
WT1_0min-SRR9929263_quant | GSE135568 | SRR9929263 | 39C_0min | 39C_0min_GSE135568 | Control |
WT1_20min-SRR9929265_quant | GSE135568 | SRR9929265 | 39C_20min | 39C_20min_GSE135568 | Treatment |
WT1_120min-SRR9929266_quant | GSE135568 | SRR9929266 | 39C_120min | 39C_120min_GSE135568 | Treatment |
WT2_0min-SRR9929264_quant | GSE135568 | SRR9929264 | 39C_0min | 39C_0min_GSE135568 | Control |
WT2_20min-SRR9929271_quant | GSE135568 | SRR9929271 | 39C_20min | 39C_20min_GSE135568 | Treatment |
WT2_120min-SRR9929272_quant | GSE135568 | SRR9929272 | 39C_120min | 39C_120min_GSE135568 | Treatment |
WT3_0min-SRR9929273_quant | GSE135568 | SRR9929273 | 39C_0min | 39C_0min_GSE135568 | Control |
WT3_20min-SRR9929274_quant | GSE135568 | SRR9929274 | 39C_20min | 39C_20min_GSE135568 | Treatment |
WT3_120min-SRR9929279_quant | GSE135568 | SRR9929279 | 39C_120min | 39C_120min_GSE135568 | Treatment |
WT4_0min-SRR9929282_quant | GSE135568 | SRR9929282 | 39C_0min | 39C_0min_GSE135568 | Control |
WT4_20min-SRR9929280_quant | GSE135568 | SRR9929280 | 39C_20min | 39C_20min_GSE135568 | Treatment |
WT4_120min-SRR9929281_quant | GSE135568 | SRR9929281 | 39C_120min | 39C_120min_GSE135568 | Treatment |
WT1_0min-SRR12454580_quant | GSE148166 | SRR12454580 | 42C_0min | 42C_0min_GSE148166 | Control |
WT1_16min-SRR12454584_quant | GSE148166 | SRR12454584 | 42C_16min | 42C_16min_GSE148166 | Treatment |
WT2_0min-SRR12454581_quant | GSE148166 | SRR12454581 | 42C_0min | 42C_0min_GSE148166 | Control |
WT2_16min-SRR12454585_quant | GSE148166 | SRR12454585 | 42C_16min | 42C_16min_GSE148166 | Treatment |
WT3_0min-SRR12449208_quant | GSE148166 | SRR12449208 | 42C_0min | 42C_0min_GSE148166 | Control |
WT3_16min-SRR12449209_quant | GSE148166 | SRR12449209 | 42C_16min | 42C_16min_GSE148166 | Treatment |
Thank you for taking the time to respond.
Yes, I downloaded raw data from NCBI's Gene Expression Omnibus and then quantified read counts using Salmon.
The dataset "identifier" is the GSE column. Each dataset was performed by a different lab and has its own treatment and control group. However, the experimental conditions are reported as being the same (or similar). I want to identify genes that are up-regulated in all of the datasets.
It was my understanding that, when all samples are processed into a DESeqDataSet (dds) together, DESeq is capable of correcting for batch-to-batch variability (see footnote)? I am essentially treating each independent dataset as a batch, so was hoping that DESeq would correct for these differences. Further, processing the datasets together in one dds should prevent genes with low counts in one dataset from being in/excluded, if that same gene in another dataset has a count above the user-set threshold.
That is my rationale for processing together. Note that I intend only to make treatment vs. control comparisons within a dataset. The resulting "hit-lists", one from each dataset, will be compared to identify conserved hits, i.e. experiments performed by Lab 1, Lab 2, and Lab 3 all found <set of genes> to be up-regulated under these conditions.
I may have misunderstood these 'parameters' within DESeq, or I could be misapplying them. Any feedback is appreciated.
From the 2nd Note under Alternative Shrinkage Estimator section in DESeq2 Vignette:
I see. I still think that processing each study separately (separate DESeq2 analyses) would be better, followed by a meta-analysis of the p-values.
When we want to adjust for a batch effect, we would usually have a design formula of the form:
~ Condition + batch
I am not sure that the way in which you are running this experiment, whereby batch is 'absorbed' into the
Condition
variable, is correcting for batch in the way that you think.You can try both ways and see what you get.