Hi all,
I was wondering if anybody has experienced this issue or if someone can chime in on a very frustrating problem I am having. I think I have not specified my groups correctly given the count data and just the fact that with normalized counts I am getting completely different log fold changes for results as well as base mean results.
The issue I am having is as follows: I already have some data that suggests what rough fold-changes are and absolutely for sure what the counts are in samples versus others (i.e control vs tumour).
I have run the raw counts through DEseq2 to generate some differential expression analysis but the results table is a little confusing and I am worried that I am not specifying the contrast correctly. my code is as follows:
all(row.names(sampledataALL$SampleName)==colnames(countsALL)) [1] TRUE
dds<-DESeqDataSetFromMatrix(countData = countsALL, colData = SampledataALL, design = ~ Type.tumour_normal+Age+Genotype)
colData(dds)... All samples line up (colnames of counts with rownames of metadata
dds$groups <- factor(paste0(dds$Genotype, dds$Model, dds$Age)) design(dds) <- ~ groups dds <- DESeq(dds, parallel = TRUE) resultsNames(dds)
i.e: groupsNormalE16.5vs_knockoutE16.5
I then specify the specific contrast I would like to see results for: Res<-results(dds, contrast=c("groups","WTE16.5","knockoutE16.5"), independentFiltering = F)
I then get a results table of the (apparently correct contrast):
log2 fold change (MLE): groups NormalE16.5 vs knockoutE16.5 Wald test p-value: groups groups NormalE16.5 vs knockoutE16.5 DataFrame with 32075 rows and 6 columns
Now here is the problem: Looking at the expression counts for a gene of interest: ENSMUSG00000014704
plotCounts(dds, gene = "ENSMUSG00000014704", intgroup = c("Age", "Genotype"), normalized = TRUE, returnData = T)
normal16.5: 480.821761 E16.5
204.041659
343.403058
226.762363
415.469461
knockout:7.0, 163.53, 20.4 and 124+366... (written our quickly not pasted like for normal e16.5) However what is really really clear is the average counts of normal are much higher than for the knockout...
Now here is the issue... The results table shows the following for this gene:
ENSMUSG00000014704 64.34613 1.297509 0.9962282 1.30242 0.1927723 1
Firstly, the basemean is way lower than the mean of counts for this gene across the samples in the contrast of interest (should be around 230)... Is there a reason for this?. Secondly, the second column of 1.29 is the logfold change.. (ignoring the significance) please may I just confirm that this is 1.29 fold higher in the normal sample compared to knockout and not the other way around? If it 1.29 fold higher in normal then this makes sense! But just to be sure that it is alwayts with respect to the first contrast.. so for example, any log fold change means that much log fold change in normal compared to knockout.
Sorry for the long post but any clarification would be much appreciated!
Many thanks!
Thank you for clarifying! So just so I'm clear .. is this 1.2 fold increase in the first contrast though? So 1.2 fold increased in the normal sample vs knockout?
Secondly, when specifying a contrast if the same age group, would this not mitigate the effects of age specified in the contrast as these are of the same age?
Many thanks!
If you specify
contrast=c("condition","B","A")
, and obtain LFC = 1.2, this means a 2^1.2 fold increase in B compared to A, controlling for other covariates. The second element (here, B) is the numerator of the fold change and the third element (here, A) is the denominator. Our workflow has a longer description of the results table and how to interpret:https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#building-the-results-table
I don't know how you've added age other than as a covariate in the design. Did you also put age in the genotype variable? What does
colData(dds)
look like?Thank you so much for that clarification and the link, that is perfectly clear!
With regards to your second question.. This is a developmental time course experiment across different genotypes. However as well simpyl doing a time series analyses, we would also like to carry out pairwise comparisons... so while we have normal data for e16.5 going to p30, we also have the same data for that time course but for knockout genotypes. But with the pairwise comparisons, I would also like to e16.5 normal, vs e16.5 knockout genotype as a pairwise comparison alone and with respect to the design, I am not sure it is that significant apart from the genotype variable? I created the $groups design just so that i could bind these variables together as otherwise the deseq runs would not allow me to carry out the right comparisons...
the colData look as follows: head(colData(dds))
Please note i have removed the samples names and just numbered them 1-5
Will this make much of a difference to the result?
the $group column concatenates the factors listed above so they can easily be accessed for pairwise comparisons...
You can perform the pairwise comparisons most easily with the group approach, yes.
Is there a remaining question then?
Thank you so so much for confirming and for clarifying this. No further questions! I can now analyse the data! Thank yo for your time :)