Hi,
I have paired rnaseq data from multiple samples, counted with featureCounts
, now planning to use DESeq2 and trying to design it. I have gone through DESEq2 comparison with mulitple cell types under 2 conditions. However, I would like to confirm if my design is correct or not?
Here is the sample coldata:
tissue condition sample1_WA1 WA1 Wild sample2_WA2 WA2 Wild sample3_WA3 WA3 Wild sample4_WB1 WB1 Wild sample5_WB2 WB2 Wild sample6_WB3 WB3 Wild sample7_WC1 WC1 Wild sample8_WC2 WC2 Wild sample9_WC3 WC3 Wild sample10_MA1 MA1 Mutant sample11_MA2 MA2 Mutant sample12_MA3 MA3 Mutant sample13_MB1 MB1 Mutant sample14_MB2 MB2 Mutant sample15_MB3 MB3 Mutant sample16_MC1 MC1 Mutant sample17_MC2 MC2 Mutant sample18_MC3 MC3 Mutant sample19_WE1 WE1 Wild sample20_WE2 WE2 Wild sample21_WE3 WE3 Wild sample22_WD1 WD1 Wild sample23_WD2 WD2 Wild sample24_WD3 WD3 Wild
where A,B,C,D,E are five tissue types and D and E are from wild condition only. 1,2 and 3 are biological replicates.
I want to perform:
(i) comparison of differentially expressed genes between all tissue types in wild
(ii) comparison of differentially expressed genes between all tissue types in mutant
(iii) comparison of differentially expressed genes between for all tissue types between wild versus mutant
(iv) comparison of differentially expressed genes between between D and E
How should I setup the design with replicates? Is this correct:
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ tissue)
dds <- DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing Warning message: In checkForExperimentalReplicates(object, modelMatrix) : same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates. read the ?DESeq section on 'Experiments without replicates'
Please guide Michael Love
Thanks!
I'm guess that e.g. WA1 is replicate one of WA. To get your analysis started, you'll at least need to separate out that using eg tidyr::extract(coldata, c("Tissue","Replicate", "(..)([123])") . Then you'll probably need to clarify what exactly you mean by "between all tissue types" - do you want a genelist for each pair or tissues, or do you want a single one that lists genes that have at least one tissue that is different from the others; or comparisons against a common 'baseline' tissue. (iii) is even more ambiguous: I think you'll only be able to use the common tissues, but do you want three lists of tissue-specific mutant vs wt, or genes that have a consistent mutant vs wt effect size across all tissues, or ...
You should also make your replication structure clear. Is there a 'batch' effect in that WA1 is closely related to WB1 (e.g. from the same individual - possible), and MB1 (unlikely from same individual, but maybe from 'batch 1'). Currently there's not really enough detail in your question to allow us to answer it.
@ Gavin Kelly Yes, you are right WA1 is the replicate 1 of WA.
A,B,C,D,E are five different tissues, while 1, 2 and 3 indicate biological replicates.
Would you please elaborate more on how I can rearrange
coldata
to make it better accessible forDESeqDataSetFromMatrix ?
This is the first time I am going to use DESeq2 and confused with the design step.(i) First of all, I want a gene list and clustered heatmap in 24 samples at padj < 0.05 and gene names in the heatmap
(ii) Gene list and clustered heatmap (padj < 0.05) between all tissue types in wild (including the replicate information in the heatmap): WA (1,2,3) vs. WB (1,2,3) vs. WC (1,2,3)
(iii) Gene list and clustered heatmap (padj < 0.05) between all tissue types in mutant (including the replicate information in the heatmap): NA (1,2,3) vs. NB (1,2,3) vs. NC (1,2,3)
(iv) Gene list and clustered heatmap (padj < 0.05) between all tissue types in wild and mutant (including the replicate information in the heatmap): WA (1,2,3) vs. WB (1,2,3) vs. WC (1,2,3) vs. NA (1,2,3) vs. NB (1,2,3) vs. NC (1,2,3)
(v) Gene list and clustered heatmap (padj < 0.05) between D and E tissue from wild (including the replicate information in the heatmap): WD (1,2,3) vs. WE (1,2,3)
Michael Love
# I tried the following for (1):
# It showed me the different combinations but the ones I want are not there:
# Then I did:
# Still "group2_B_vs_C" and "group2_D_vs_E" are absent.
When you use ‘group’ you’re not restricted to what is in resultsNames(), just use ‘contrast’ to compare the levels of ‘group’ that you want to compare.
Thanks Michael Love ! I did:
After subset and ordered by padj, I am interested to annotate the list of genes with ensembl and then heatmap, so I wanted to have a normalized dataframe, but it gave me a dataframe including other samples as well which I don't want:
Please guide.
Thanks.
I’d recommend getting some help from someone with R experience, or just follow code from our vignette or workflow. Here you never subset the counts matrix. And there are safer ways to go at this.
Vignette says:
which is for whole data.
I am confused on how I can get normalized count dataframe for the samples belong to the group mentioned in the contrast. :(
That is one example of 'select', picking the top 20 genes by the mean count.
If you want a heatmap of the top genes by adjusted p-value in a results table, you could use:
You can then use 'idx' to subset the rows of any matrix or results table produced by DESeq2, e.g. counts(dds), counts(dds, normalized=TRUE), res, assay(vsd), etc. As long as you haven't yet re-ordered or subsetted those matrices/tables, then you can index each of them with the same vector 'idx'. I wouldn't use merge() here, it could lead to some bugs.
Thanks Michael Love. I think I have been unable to make my query clear. I will explain with an example.
But, when I use contrast, for e.g:
I got 70 genes for Awild versus Cwild groups, which are 6 samples in total.
Therefore, my desired normalized dataframe should have only 6 samples however, it is showing all samples for 70 genes.
This is what I am unable to figure out. :(
I have to make many other pairwise comparisons and I am stuck at the first one only since long days.
So you need to subset the columns. In R you subset a matrix by columns using a paradigm: [,idx]
Might be good to lookup an R reference for basic object manipulation. I’ve found “Quick R” to be a good one. (Search google for “Quick R”)
Thanks Michael Love. I am aware of that and was just wondering if inside DESeq2 there is some automated way while using contrast.
Anyways, thank you so much for answering my queries. :)