Question

Results table completely wrong for specified contrast of

0

Entering edit mode

A ▴ 40

@a-14337

Last seen 5 months ago

United Kingdom

Hi all,

I was wondering if anybody has experienced this issue or if someone can chime in on a very frustrating problem I am having. I think I have not specified my groups correctly given the count data and just the fact that with normalized counts I am getting completely different log fold changes for results as well as base mean results.

The issue I am having is as follows: I already have some data that suggests what rough fold-changes are and absolutely for sure what the counts are in samples versus others (i.e control vs tumour).

I have run the raw counts through DEseq2 to generate some differential expression analysis but the results table is a little confusing and I am worried that I am not specifying the contrast correctly. my code is as follows:

all(row.names(sampledataALL$SampleName)==colnames(countsALL)) [1] TRUE

dds<-DESeqDataSetFromMatrix(countData = countsALL, colData = SampledataALL, design = ~ Type.tumour_normal+Age+Genotype)

colData(dds)... All samples line up (colnames of counts with rownames of metadata

dds$groups <- factor(paste0(dds$Genotype, dds$Model, dds$Age)) design(dds) <- ~ groups dds <- DESeq(dds, parallel = TRUE) resultsNames(dds)

i.e: groupsNormalE16.5vs_knockoutE16.5

I then specify the specific contrast I would like to see results for: Res<-results(dds, contrast=c("groups","WTE16.5","knockoutE16.5"), independentFiltering = F)

I then get a results table of the (apparently correct contrast):

log2 fold change (MLE): groups NormalE16.5 vs knockoutE16.5 Wald test p-value: groups groups NormalE16.5 vs knockoutE16.5 DataFrame with 32075 rows and 6 columns

Now here is the problem: Looking at the expression counts for a gene of interest: ENSMUSG00000014704

plotCounts(dds, gene = "ENSMUSG00000014704", intgroup = c("Age", "Genotype"), normalized = TRUE, returnData = T)

normal16.5: 480.821761 E16.5
204.041659
343.403058
226.762363
415.469461 knockout:7.0, 163.53, 20.4 and 124+366... (written our quickly not pasted like for normal e16.5) However what is really really clear is the average counts of normal are much higher than for the knockout... Now here is the issue... The results table shows the following for this gene:

ENSMUSG00000014704 64.34613 1.297509 0.9962282 1.30242 0.1927723 1

Firstly, the basemean is way lower than the mean of counts for this gene across the samples in the contrast of interest (should be around 230)... Is there a reason for this?. Secondly, the second column of 1.29 is the logfold change.. (ignoring the significance) please may I just confirm that this is 1.29 fold higher in the normal sample compared to knockout and not the other way around? If it 1.29 fold higher in normal then this makes sense! But just to be sure that it is alwayts with respect to the first contrast.. so for example, any log fold change means that much log fold change in normal compared to knockout.

Sorry for the long post but any clarification would be much appreciated!

Many thanks!

DESeq2 logfoldchange counts • 1.4k views

ADD COMMENT • link updated 5.3 years ago by Michael Love 42k • written 5.3 years ago by A ▴ 40

Martin Morgan · Answer 1 · 2019-01-21

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 1 hour ago

United States

Base mean is the mean of normalized counts over all samples.

Secondly, your design is not simply ~genotype but has other variables. The LFC takes these into account, not just looking marginally at the counts over genotype.

ADD COMMENT • link 5.3 years ago Michael Love 42k

0

Entering edit mode

Thank you for clarifying! So just so I'm clear .. is this 1.2 fold increase in the first contrast though? So 1.2 fold increased in the normal sample vs knockout?

Secondly, when specifying a contrast if the same age group, would this not mitigate the effects of age specified in the contrast as these are of the same age?

Many thanks!

ADD REPLY • link 5.3 years ago A ▴ 40

0

Entering edit mode

If you specify contrast=c("condition","B","A"), and obtain LFC = 1.2, this means a 2^1.2 fold increase in B compared to A, controlling for other covariates. The second element (here, B) is the numerator of the fold change and the third element (here, A) is the denominator. Our workflow has a longer description of the results table and how to interpret:

https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#building-the-results-table

I don't know how you've added age other than as a covariate in the design. Did you also put age in the genotype variable? What does colData(dds) look like?

ADD REPLY • link 5.3 years ago Michael Love 42k

0

Entering edit mode

Thank you so much for that clarification and the link, that is perfectly clear!

With regards to your second question.. This is a developmental time course experiment across different genotypes. However as well simpyl doing a time series analyses, we would also like to carry out pairwise comparisons... so while we have normal data for e16.5 going to p30, we also have the same data for that time course but for knockout genotypes. But with the pairwise comparisons, I would also like to e16.5 normal, vs e16.5 knockout genotype as a pairwise comparison alone and with respect to the design, I am not sure it is that significant apart from the genotype variable? I created the $groups design just so that i could bind these variables together as otherwise the deseq runs would not allow me to carry out the right comparisons...

the colData look as follows: head(colData(dds))

DataFrame with 6 rows and 8 columns
         SampleName            SampleType
           <factor>              <factor>
1   1 WT_x_E16d5
2  2 WT_x_E16d5
3  3WT_x_E16d5
4   3 WT_x_E16d5
5  4WT_x_E16d5
6       5      WT_x_P5
         Genotype Type.tumour_normal
         <factor>           <factor>
      WT         Cerebellum
       WT         Cerebellum
       WT         Cerebellum
     WT         Cerebellum
       WT         Cerebellum
         WT         Cerebellum
              Age   Replicate      Model

Please note i have removed the samples names and just numbered them 1-5

Will this make much of a difference to the result?

the $group column concatenates the factors listed above so they can easily be accessed for pairwise comparisons...

ADD REPLY • link updated 5.3 years ago by Martin Morgan 25k • written 5.3 years ago by A ▴ 40

0

Entering edit mode

You can perform the pairwise comparisons most easily with the group approach, yes.

Is there a remaining question then?

ADD REPLY • link 5.3 years ago Michael Love 42k

0

Entering edit mode

Thank you so so much for confirming and for clarifying this. No further questions! I can now analyse the data! Thank yo for your time :)

ADD REPLY • link 5.3 years ago A ▴ 40