DEseq2 : Difference between results and plotCounts
1
0
Entering edit mode
@thomasderrien-20244
Last seen 4.0 years ago

Hi all,

First post here :-)

I'm analyzing Differentially Expressed Genes (DEG) in 2 conditions (H = Healthy vs T = Tumor) using DEseq2 (version DESeq2_1.22.2).

When checking/visualising significant DEG with the plotCounts function, I came across one gene with a weird behavior.

The gene, called RLOC_00033176, has a log2FoldChange >5 ( i.e. more expressed in T condition versus H):

>resMF["RLOC_00033176",]
log2 fold change (MLE): condition T vs H
Wald test p-value: condition T vs H
DataFrame with 1 row and 6 columns
                      baseMean   log2FoldChange            lfcSE             stat              pvalue                 padj
                     <numeric>        <numeric>        <numeric>        <numeric>           <numeric>            <numeric>
RLOC_00033176 1882.05768131145 5.85706382402349 1.20253394435727 4.87060165869492 1.1125894014862e-06 0.000277571873094918

However, looking at the counts using plotCounts, I'm observing the opposite trend:

> d <-  plotCounts(ddsMF, gene="RLOC_00033176", intgroup=c("condition"), returnData=TRUE)
> d
                count condition
H01_GRET  6770.361876         H
H01_LABR  4832.668360         H
H02_GRET  6726.804182         H
H02_PODL  2876.324159         H
H03_GRET  7046.313634         H
H03_PODL  6282.789856         H
H04_PODL 12212.188122         H
H06_GRET  6721.764971         H
H06_PODL  6898.072519         H
H07_PODL 10049.535318         H
H08_PODL  6071.940173         H
H09_PODL  3283.395077         H
H12_PODL  8665.185250         H
T01_GRET   556.160217         T
T01_LABR     3.557992         T
T01_PODL     2.817237         T
T02_GRET     6.123579         T
T02_LABR     0.500000         T
T02_PODL  1957.434186         T
T03_GRET  1625.660425         T
T03_PODL  1893.038074         T
T04_GRET   349.213210         T
T04_PODL    14.740731         T
T05_LABR     2.141938         T
T05_PODL     2.564385         T
T06_GRET     2.218089         T
T06_LABR     3.884637         T
T06_PODL     0.500000         T
T07_GRET     1.739414         T
T07_PODL     1.494927         T
T08_GRET     1.656400         T
T08_LABR     1.386739         T
T08_PODL     8.495810         T
T09_GRET     4.069766         T
T09_LABR     1.417592         T
T09_PODL     1.479236         T
T10_LABR   419.258909         T
T11_GRET     2.374560         T
T11_LABR   182.475189         T
T12_GRET   666.680985         T
T12_LABR     3.949453         T
T12_PODL   380.173398         T
T13_GRET     9.495421         T
T13_LABR     0.500000         T
T14_GRET   467.288773         T
T15_GRET     1.578646         T
T15_LABR     0.500000         T
T16_GRET   851.942543         T
T17_GRET    22.565463         T
T17_LABR     1.738759         T
T18_LABR     2.339252         T
T19_GRET     0.500000         T

I was wondering how this could be the case?

Thanks

Thomas

PS : Note that for the others significant DEG I've checked, everything seems ok e.g:

> resMF["RLOC_00008433",]
log2 fold change (MLE): condition T vs H
Wald test p-value: condition T vs H
DataFrame with 1 row and 6 columns
                     baseMean   log2FoldChange             lfcSE             stat               pvalue                 padj
                    <numeric>        <numeric>         <numeric>        <numeric>            <numeric>            <numeric>
RLOC_00008433 21.793008121385 3.70792814522459 0.772645934994616 4.79900039239892 1.59459482173036e-06 0.000372157855974811
> d <-  plotCounts(ddsMF gene="RLOC_00008433", intgroup=c("condition"), returnData=TRUE)
> d
              count condition
H01_GRET   5.331474         H
H01_LABR   4.581223         H
H02_GRET   7.376440         H
H02_PODL   3.912766         H
H03_GRET   9.221608         H
H03_PODL   4.643991         H
H04_PODL   1.750685         H
H06_GRET   7.247129         H
H06_PODL   1.507386         H
H07_PODL   3.933220         H
H08_PODL   1.567037         H
H09_PODL   1.554576         H
H12_PODL   4.967254         H
T01_GRET  19.571021         T
T01_LABR  31.079923         T
T01_PODL  44.527503         T
T02_GRET  20.584212         T
T02_LABR  15.330119         T
T02_PODL 106.535597         T
T03_GRET  28.993733         T
T03_PODL  21.933047         T
T04_GRET  24.038142         T
T04_PODL  13.949579         T
T05_LABR  30.875846         T
T05_PODL  17.015079         T
T06_GRET  54.619789         T
T06_LABR  47.884916         T
T06_PODL  19.325809         T
T07_GRET  35.203591         T
T07_PODL  12.439128         T
T08_GRET  24.784397         T
T08_LABR  43.950211         T
T08_PODL   2.276847         T
T09_GRET  37.090099         T
T09_LABR  21.604619         T
T09_PODL   3.437709         T
T10_LABR  17.522720         T
T11_GRET  34.242083         T
T11_LABR  41.127019         T
T12_GRET  30.820331         T
T12_LABR  13.435447         T
T12_PODL  12.212290         T
T13_GRET  38.480664         T
T13_LABR  76.910348         T
T14_GRET  12.169719         T
T15_GRET  37.173954         T
T15_LABR  16.559524         T
T16_GRET  10.503894         T
T17_GRET  12.452126         T
T17_LABR  27.752692         T
T18_LABR  23.490647         T
T19_GRET  19.737259         T
deseq2 • 1.4k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

In your plotCounts() call it looks like you’ve got a different gene ID.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thank you for your reply and sorry for the mistake, I did the wrong copy/paste... I've edited the example with the correct gene RLOC_00033176.

ADD REPLY
0
Entering edit mode

What is assays(dds)[[“mu”]] for this gene

ADD REPLY
0
Entering edit mode

Here it is:

>  assays(ddsMF)[["mu"]]["RLOC_00033176",]
    H01_GRET     H01_LABR     H02_GRET     H02_PODL     H03_GRET     H03_PODL     H04_PODL     H06_GRET     H06_PODL     H07_PODL     H08_PODL     H09_PODL     H12_PODL     T01_GRET     T01_LABR     T01_PODL     T02_GRET     T02_LABR     T02_PODL     T03_GRET
 6508.486406  5320.294565  4950.829734  1322.219262 18644.100440 19182.061843  9865.395984  8863.600860  7638.339895  9712.159697 31190.873648  1495.360587  5127.154393   464.941596     1.810031    22.975882     6.522102     1.317781 12662.706380   839.837760
    T03_PODL     T04_GRET     T04_PODL     T05_LABR     T05_PODL     T06_GRET     T06_LABR     T06_PODL     T07_GRET     T07_PODL     T08_GRET     T08_LABR     T08_PODL     T09_GRET     T09_LABR     T09_PODL     T10_LABR     T11_GRET     T11_LABR     T12_GRET
 7000.753236    90.153555     3.666837     1.610363     4.602280     3.257818     6.652827     9.706972     5.404377     5.209136    12.756633     2.220584     5.402220     3.756431    11.482776     4.409466   362.995193     4.092847    50.963230  2625.616910
    T12_LABR     T12_PODL     T13_GRET     T13_LABR     T14_GRET     T15_GRET     T15_LABR     T16_GRET     T17_GRET     T17_LABR     T18_LABR     T19_GRET
    2.929922   277.027655    13.981462     2.477789   209.307263    10.552216     4.633799  1249.067901    28.271561     1.400615     2.016944     4.567548
ADD REPLY
0
Entering edit mode

From the estimated means I wonder: are you using a simple design or are you also controlling for covariates other than condition?

ADD REPLY
0
Entering edit mode

Since these are canine samples affected by melanomas, I'm using a multifactor design where I controlled for 3 covariates:

  • gender/sex : factorial with M/F/NotDefined ,
  • breed of the dog : factorial with LABR/GRET/PODL,
  • keratinocyte : continuous variable which estimates the % of this cell type. Although you recommended to cut into factors, I manually observed better results on known DEG using continuous values for this covariates.

Does it seem reasonable with you?

Here is the results of coef:

  >  coef(ddsMF)["RLOC_00033176",]
               Intercept          sexe_M_vs_F sexe_NotDefined_vs_F        Keratinocytes      breed_LABR_vs_GRET   breed_PODL_vs_GRET     condition_T_vs_H
           -5.8963299271         0.4609023690         0.4326127566         0.0041769811           -0.7794070091         0.3181122419         5.8570638240
ADD REPLY
0
Entering edit mode

Can you post all of your code? This will help us get to the answer faster. Generally you should always post your design also, because this helps the support site answer-ers know what is going on.

ADD REPLY
0
Entering edit mode

Sure, Michael! Thanks for your help!

Here is the whole code including design:

ddsMF<- DESeqDataSetFromMatrix(countData = dat.m, colData   = coldata_final,
+ design    = ~  sexe + Keratinocytes + breed + condition)
converting counts to integer mode
> ddsMF <- DESeq(ddsMF)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> resMF <- results(ddsMF)
> summary(resMF)

out of 29662 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 1981, 6.7%
LFC < 0 (down)     : 2258, 7.6%
outliers [1]       : 0, 0%
low counts [2]     : 7389, 25%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> sum(resMF$padj < 0.05, na.rm=TRUE)
[1] 2850

And the coldata_final:

>coldata_final
         condition breed       sexe Keratinocytes
H01_GRET         H  GRET          M     4305.3576
H01_LABR         H  LABR          M     4393.1654
H02_GRET         H  GRET          F     4217.8689
H02_PODL         H  PODL          M     3602.7101
H03_GRET         H  GRET          M     4618.1448
H03_PODL         H  PODL          F     4838.2476
H04_PODL         H  PODL NotDefined     4471.6863
H06_GRET         H  GRET          F     4576.4286
H06_PODL         H  PODL          M     4221.5814
H07_PODL         H  PODL          M     4428.8287
H08_PODL         H  PODL NotDefined     4691.1405
H09_PODL         H  PODL          F     3826.2077
H12_PODL         H  PODL          F     4243.1659
T01_GRET         T  GRET          M     1942.6621
T01_LABR         T  LABR          M      240.5874
T01_PODL         T  PODL          M      926.4567
T02_GRET         T  GRET          F      546.2896
T02_LABR         T  LABR          M      111.4108
T02_PODL         T  PODL          M     3080.7421
T03_GRET         T  GRET          M     2177.2615
T03_PODL         T  PODL          F     2993.3280
T04_GRET         T  GRET          M     1364.2352
T04_PODL         T  PODL NotDefined      159.6799
T05_LABR         T  LABR          F      261.4427
T05_PODL         T  PODL          M      328.4182
T06_GRET         T  GRET          F      326.3872
T06_LABR         T  LABR          M      760.6346
T06_PODL         T  PODL          M      587.5078
T07_GRET         T  GRET NotDefined      530.1149
T07_PODL         T  PODL          M      358.2173
T08_GRET         T  GRET NotDefined      806.9621
T08_LABR         T  LABR NotDefined      283.4559
T08_PODL         T  PODL NotDefined      341.2837
T09_GRET         T  GRET NotDefined      295.3749
T09_LABR         T  LABR NotDefined      870.8780
T09_PODL         T  PODL          F      412.8279
T10_LABR         T  LABR NotDefined     2047.5735
T11_GRET         T  GRET NotDefined      333.1180
T11_LABR         T  LABR NotDefined     1365.1223
T12_GRET         T  GRET NotDefined     2552.8283
T12_LABR         T  LABR NotDefined      387.2007
T12_PODL         T  PODL          F     1845.2103
T13_GRET         T  GRET NotDefined      736.6193
T13_LABR         T  LABR NotDefined      434.5246
T14_GRET         T  GRET NotDefined     1685.3634
T15_GRET         T  GRET NotDefined      711.8128
T15_LABR         T  LABR NotDefined      545.2992
T16_GRET         T  GRET NotDefined     2379.2058
T17_GRET         T  GRET NotDefined     1001.1593
T17_LABR         T  LABR NotDefined      252.3405
T18_LABR         T  LABR NotDefined      285.7015
T19_GRET         T  GRET NotDefined      393.6673
ADD REPLY
0
Entering edit mode

I've found the problem. You cannot include Keratinocytes in this model as it stands. It is perfectly separable by condition, and although the software doesn't output an error here, it would if you had binarized the former because it would have recognized that their are perfectly confounded.

> table(x$Keratinocytes < 3300, x$condition)

         H  T
  FALSE 13  0
  TRUE   0 39

Take a look also at boxplot of Keratinocytes over condition. You need to either drop Keratinocytes from the model, or if you think it's critical to control for Keratinocytes while testing for condition effects, you need to design a new experiment where it is possible to control for this and estimate the T vs H difference.

ADD REPLY
0
Entering edit mode

Thank you Michael for the great help! I'm afraid it will be difficult to design a new experiment at this point, thus, I might just see only 2 alternatives:

1/ Factorize the Keratinocytes covariate in n classes (n>2) with cut (e.g. n=3):

coldata_final$kerQuant            <- cut(coldata_final$Keratinocytes, breaks=quantile(coldata_final$Keratinocytes, probs=seq(0,1, length  = 3), na.rm=TRUE),include.lowest=TRUE)

Unfortunately, this leads to a very high number of DEG with: sum(resMF$padj < 0.05, na.rm=TRUE) = 9490 as compared to the 2850 in the previous analysis...

2/ Finding a way to exclude aberrant DEG(s) such as the one previously mentioned.

Given your expertise, would you recommend any of these alternatives?

Thanks again, I really appreciate.

ADD REPLY
0
Entering edit mode

My position is that you cannot include this covariate in the model no matter how you cut or transform it. It is essentially perfectly confounded with condition. It will give misleading results to include the covariate along with condition.

ADD REPLY
0
Entering edit mode

Also what do you get for coef(dds) for this gene?

ADD REPLY

Login before adding your answer.

Traffic: 941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6