Entering edit mode
Dear Francesco,
I think that you may be mis-judging the nature of a linear model with
multiple factors and covariates.
Limma computes log-fold-changes by fitting a linear model and
extracting
the estimated coefficients. The interpretation of each coefficient is
that it represents the change in the reponse (log-expression in this
case)
for each unit change in the covariate, assuming that all other
covariates
in the model can be held constant.
For example, the coefficient for alpha in your hypothetical model is
the
change in the log-expression of the gene that would occur if alpha was
changed (from 0 to 1) but all other factors in the model were held
constant.
This is not the same as fitting a simple model with just alpha and no
other factors. The coefficient that you would get from the simple
model
with just alpha will not be the same as the coefficient you would get
for
alpha from the full model. In the simple model, the other factors are
not
held constant. It would easily be that the full model could give a
negative coefficient but the simple model gives a positive
coefficient.
This occurs when there are other factors in the model that are
positively
correlated with alpha but have the opposite effect on expression.
When you make a simple summary of log-expression values for wild-type
KRAS
vs. mutated, you are looking at the marginal effect of KRAS when all
other
factors are allowed to vary as well. In other words, you are
examining
the simple model with just KRAS as a factor. There is no reason that
this
should be the same as the logFC for KRAS when all other factors have
been
adjusted for. The simple effect could easily be driven by factors
other
than KRAS.
You might find it helpful to consult a textbook on multiple regression
that discusses the subtleties of unbalanced anova or correlated
covariates.
Best wishes
Gordon
> Date: Mon, 14 Apr 2014 06:22:45 -0700 (PDT)
> From: "Francesco Gatto [guest]" <guest at="" bioconductor.org="">
> To: bioconductor at r-project.org, gatto at chalmers.se
> Subject: [BioC] Interpretation of fold.changes for covariates in a
> linear model using voom
>
> I have implemented a linear model to fit RNAseq read counts for
~1000
> samples using ~200 covariates plus the intercept. The pipeline
follows
> the example in the user manual, i.e. v=voom(y,design) ->
> fit=lmFit(v,design) -> fit2=eBayes(fit). Then I am using
decideTests()
> to assess the effect of each covariate on gene expression. However,
the
> output is less complete than topTable(), so I used this function to
> obtain values on fold-changes and individual p-values. At this
stage,
> values become difficult to interpret. If the LM is:
>
> gene_i,j = intercept_i + alfa_i*Factor1_j + beta_i*Factor2_j + (~200
other factors)
>
> and, to simplify, let's say covariates are binary. Then what do the
> fold-changes for topTable(fit2,coef="alfa") mean?
> My interpretation was the up/down-regulation due to the presence of
> covariate alfa, but if I boxplot or summarize the expression values
for
> the two groups (alfa = 1 vs alfa = 0) I get quite contradictory
results.
> For example this is the most DE gene when the sample has a mutation
in
> KRAS (the covariate, 0 if wild-type, 1 if mutated):
>
>> topTable(fit2,coef="metadataKRAS",number=1)
>
> hgnc_symbol logFC AveExpr t P.Value adj.P.Val
B
> GOLGA6A -1.314213 -5.9000494 -7.375175 3.800569e-13 7.204358e-09
18.80961
>
> The summary of gene expression values for wild-type KRAS vs. mutated
is:
>
>> by(E["GOLGA6A",],design[,"metadataKRAS"],summary)
>
> design[, "metadataKRAS"]: 0
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> -7.882 -6.865 -6.461 -6.021 -5.770 1.548
> ----------------------------------------------------------------
> design[, "metadataKRAS"]: 1
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> -7.657 -6.247 -5.487 -5.239 -4.582 -1.311
>
> This is an up-regulation, and of magnitude way lower than the one
> indicated by topTable. Is this because I am misusing the function
> topTable? I cannot wrap my head around this. Thanks for your help!
>
> -- output of sessionInfo():
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] edgeR_3.4.2 limma_3.18.9
>
> loaded via a namespace (and not attached):
> [1] tools_3.0.2
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}