Hi all,
First post here :-)
I'm analyzing Differentially Expressed Genes (DEG) in 2 conditions (H = Healthy vs T = Tumor) using DEseq2 (version DESeq2_1.22.2
).
When checking/visualising significant DEG with the plotCounts
function, I came across one gene with a weird behavior.
The gene, called RLOC_00033176
, has a log2FoldChange >5 ( i.e. more expressed in T condition versus H):
>resMF["RLOC_00033176",]
log2 fold change (MLE): condition T vs H
Wald test p-value: condition T vs H
DataFrame with 1 row and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
RLOC_00033176 1882.05768131145 5.85706382402349 1.20253394435727 4.87060165869492 1.1125894014862e-06 0.000277571873094918
However, looking at the counts using plotCounts
, I'm observing the opposite trend:
> d <- plotCounts(ddsMF, gene="RLOC_00033176", intgroup=c("condition"), returnData=TRUE)
> d
count condition
H01_GRET 6770.361876 H
H01_LABR 4832.668360 H
H02_GRET 6726.804182 H
H02_PODL 2876.324159 H
H03_GRET 7046.313634 H
H03_PODL 6282.789856 H
H04_PODL 12212.188122 H
H06_GRET 6721.764971 H
H06_PODL 6898.072519 H
H07_PODL 10049.535318 H
H08_PODL 6071.940173 H
H09_PODL 3283.395077 H
H12_PODL 8665.185250 H
T01_GRET 556.160217 T
T01_LABR 3.557992 T
T01_PODL 2.817237 T
T02_GRET 6.123579 T
T02_LABR 0.500000 T
T02_PODL 1957.434186 T
T03_GRET 1625.660425 T
T03_PODL 1893.038074 T
T04_GRET 349.213210 T
T04_PODL 14.740731 T
T05_LABR 2.141938 T
T05_PODL 2.564385 T
T06_GRET 2.218089 T
T06_LABR 3.884637 T
T06_PODL 0.500000 T
T07_GRET 1.739414 T
T07_PODL 1.494927 T
T08_GRET 1.656400 T
T08_LABR 1.386739 T
T08_PODL 8.495810 T
T09_GRET 4.069766 T
T09_LABR 1.417592 T
T09_PODL 1.479236 T
T10_LABR 419.258909 T
T11_GRET 2.374560 T
T11_LABR 182.475189 T
T12_GRET 666.680985 T
T12_LABR 3.949453 T
T12_PODL 380.173398 T
T13_GRET 9.495421 T
T13_LABR 0.500000 T
T14_GRET 467.288773 T
T15_GRET 1.578646 T
T15_LABR 0.500000 T
T16_GRET 851.942543 T
T17_GRET 22.565463 T
T17_LABR 1.738759 T
T18_LABR 2.339252 T
T19_GRET 0.500000 T
I was wondering how this could be the case?
Thanks
Thomas
PS : Note that for the others significant DEG I've checked, everything seems ok e.g:
> resMF["RLOC_00008433",]
log2 fold change (MLE): condition T vs H
Wald test p-value: condition T vs H
DataFrame with 1 row and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
RLOC_00008433 21.793008121385 3.70792814522459 0.772645934994616 4.79900039239892 1.59459482173036e-06 0.000372157855974811
> d <- plotCounts(ddsMF gene="RLOC_00008433", intgroup=c("condition"), returnData=TRUE)
> d
count condition
H01_GRET 5.331474 H
H01_LABR 4.581223 H
H02_GRET 7.376440 H
H02_PODL 3.912766 H
H03_GRET 9.221608 H
H03_PODL 4.643991 H
H04_PODL 1.750685 H
H06_GRET 7.247129 H
H06_PODL 1.507386 H
H07_PODL 3.933220 H
H08_PODL 1.567037 H
H09_PODL 1.554576 H
H12_PODL 4.967254 H
T01_GRET 19.571021 T
T01_LABR 31.079923 T
T01_PODL 44.527503 T
T02_GRET 20.584212 T
T02_LABR 15.330119 T
T02_PODL 106.535597 T
T03_GRET 28.993733 T
T03_PODL 21.933047 T
T04_GRET 24.038142 T
T04_PODL 13.949579 T
T05_LABR 30.875846 T
T05_PODL 17.015079 T
T06_GRET 54.619789 T
T06_LABR 47.884916 T
T06_PODL 19.325809 T
T07_GRET 35.203591 T
T07_PODL 12.439128 T
T08_GRET 24.784397 T
T08_LABR 43.950211 T
T08_PODL 2.276847 T
T09_GRET 37.090099 T
T09_LABR 21.604619 T
T09_PODL 3.437709 T
T10_LABR 17.522720 T
T11_GRET 34.242083 T
T11_LABR 41.127019 T
T12_GRET 30.820331 T
T12_LABR 13.435447 T
T12_PODL 12.212290 T
T13_GRET 38.480664 T
T13_LABR 76.910348 T
T14_GRET 12.169719 T
T15_GRET 37.173954 T
T15_LABR 16.559524 T
T16_GRET 10.503894 T
T17_GRET 12.452126 T
T17_LABR 27.752692 T
T18_LABR 23.490647 T
T19_GRET 19.737259 T
Hi Michael,
Thank you for your reply and sorry for the mistake, I did the wrong copy/paste... I've edited the example with the correct gene
RLOC_00033176
.What is assays(dds)[[“mu”]] for this gene
Here it is:
From the estimated means I wonder: are you using a simple design or are you also controlling for covariates other than condition?
Since these are canine samples affected by melanomas, I'm using a multifactor design where I controlled for 3 covariates:
Does it seem reasonable with you?
Here is the results of coef:
Can you post all of your code? This will help us get to the answer faster. Generally you should always post your design also, because this helps the support site answer-ers know what is going on.
Sure, Michael! Thanks for your help!
Here is the whole code including design:
And the coldata_final:
I've found the problem. You cannot include
Keratinocytes
in this model as it stands. It is perfectly separable bycondition
, and although the software doesn't output an error here, it would if you had binarized the former because it would have recognized that their are perfectly confounded.Take a look also at boxplot of
Keratinocytes
overcondition
. You need to either dropKeratinocytes
from the model, or if you think it's critical to control forKeratinocytes
while testing for condition effects, you need to design a new experiment where it is possible to control for this and estimate the T vs H difference.Thank you Michael for the great help! I'm afraid it will be difficult to design a new experiment at this point, thus, I might just see only 2 alternatives:
1/ Factorize the Keratinocytes covariate in n classes (n>2) with
cut
(e.g. n=3):Unfortunately, this leads to a very high number of DEG with:
sum(resMF$padj < 0.05, na.rm=TRUE)
= 9490 as compared to the 2850 in the previous analysis...2/ Finding a way to exclude aberrant DEG(s) such as the one previously mentioned.
Given your expertise, would you recommend any of these alternatives?
Thanks again, I really appreciate.
My position is that you cannot include this covariate in the model no matter how you cut or transform it. It is essentially perfectly confounded with condition. It will give misleading results to include the covariate along with condition.
Also what do you get for coef(dds) for this gene?