Search
Question: Linear regression model in RNA seq
1
19 months ago by
Karl Fees10
Karl Fees10 wrote:

• I am working quite a while now on RNAseq data. Brief study design: We have human tissue biopsies. The tissues come from 4 different groups, which can be categorized due to their BMI. Hence, we have group 0, 1, 2, 3 for this tissue. I run different pipelines to find differential expressed genes among the groups. However, I now have an idea in mind which I want to test and looking for help. Let's assume it makes a lot of biological sense to find genes which follow a linear regression from group 0 -> group 3. Hence, I want to perform a linear regression for every gene to get the slope and p-value of each gene (accordingly the expression from gene increases or decreases constantly from group 0 to group 3). I know that I will e.g. not detect genes which might regulated in only one group but this not the aim of this analysis. Can somebody out there give me hint what would be the best way to do this? I was thinking of using the voom/limma approach. However, is there also another/better way or is the best approach to do so? Can I implant this question in DESeq2? Could I run my own glm on e.g. DESeq2 normalized samples?
• On my search I stumbled over a quite similar previous post (Question: DESEQ2 linear model of dose) where they advised to out the vector of interest into the sample information while building the DESeq data set. So I tried this:

sample_group <- sample
sample_group$groupe <-c(0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3) (sample_group$groupe <- factor(sample_group\$groupe , levels = c(0,1,2,3), labels = c(0,1,2,3)))
cds <- DESeqDataSetFromMatrix(counts_3, sample_group, design = ~ groupe )

cd_DESeq <- DESeq(cds)

results(cd_DESeq)

> results(cd_DESeq)
log2 fold change (MAP): groupe 3 vs 0
Wald test p-value: groupe 3 vs 0
DataFrame with 12889 rows and 6 columns
baseMean log2FoldChange      lfcSE        stat     pvalue      padj
<numeric>      <numeric>  <numeric>   <numeric>  <numeric> <numeric>
7SK        59.7498     0.25838015 0.13543938   1.9077181 0.05642766 0.3133326
A2M     13990.2247    -0.07047253 0.15876606  -0.4438765 0.65713185 0.8821618
A2M-AS1   189.2881     0.21297458 0.13393174   1.5901725 0.11179593 0.4326408

In this previous post, they then claimed that the log2FoldChange is the regression coefficient. However, I can still form different contrasts say

results(cd_DESeq, contrast = c("groupe", "3", "1"))

> results(cd_DESeq, contrast = c("groupe", "3", "1"))
log2 fold change (MAP): groupe 3 vs 1
Wald test p-value: groupe 3 vs 1
DataFrame with 12889 rows and 6 columns
baseMean log2FoldChange      lfcSE       stat     pvalue      padj
<numeric>      <numeric>  <numeric>  <numeric>  <numeric> <numeric>
7SK        59.7498     0.30781862 0.13015504  2.3650151 0.01802934 0.1655048
A2M     13990.2247     0.02960835 0.15492933  0.1911087 0.84844040 0.9526261
A2M-AS1   189.2881     0.31484488 0.12915912  2.4376513 0.01478303 0.1516203

So, this gives different log2FoldChanges. I would have expected that these changes would stay the same independent of the contrast. Or gives me the first table with 'Wald test p-value: groupe 3 vs 0' already the needed information I want? Hence, I get the regression coefficient from group 0 -> group 3 and in 'Wald test p-value: groupe 3 vs 1' I would then get the regression coefficient from group 1 -> 3 (thus excluding group 0 from the liner regression?)?

Many thanks for any help!

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0             GenomicRanges_1.26.3       GenomeInfoDb_1.10.3        IRanges_2.8.1              S4Vectors_0.12.1           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9          RColorBrewer_1.1-2   plyr_1.8.4           XVector_0.14.0       bitops_1.0-6         tools_3.3.2          zlibbioc_1.20.0      digest_0.6.12        rpart_4.1-10         base64_2.0
[11] memoise_1.0.0        RSQLite_1.1-2        annotate_1.52.1      tibble_1.2           gtable_0.2.0         htmlTable_1.9        lattice_0.20-34      Matrix_1.2-8         DBI_0.5-1            gridExtra_2.2.1
[21] genefilter_1.56.0    stringr_1.2.0        knitr_1.15.1         cluster_2.0.5        locfit_1.5-9.1       grid_3.3.2           nnet_7.3-12          data.table_1.10.4    AnnotationDbi_1.36.2 XML_3.98-1.5
[31] survival_2.40-1      BiocParallel_1.8.1   foreign_0.8-67       latticeExtra_0.6-28  Formula_1.2-1        geneplotter_1.52.0   ggplot2_2.2.1        magrittr_1.5         htmltools_0.3.5      Hmisc_4.0-2
[41] scales_0.4.1         splines_3.3.2        assertthat_0.1       xtable_1.8-2         colorspace_1.3-2     stringi_1.1.2        acepack_1.4.1        RCurl_1.95-4.8       lazyeval_0.2.0       openssl_0.9.6
[51] munsell_0.4.3

modified 19 months ago • written 19 months ago by Karl Fees10

Many thanks for quick and absolutely helpful response.

Sorry for my first noobish question, but the contrasts you modeled can then be added within the limma pipeline right? I am not familiar with limma yet, but with your first model I should get for every gene a regression coefficient and p.value telling me whether I see an linear increased/decreased expression?

Sure, I am not very confident how (biological) meaningful the results will be, but I have a second analysis in mind for which I would like to have these data (and thus I just want to try it out).

However, I really like your second approach! I will definitely try this as well (and again this contrast can then be included in limma?).

For DESeq2, I understood it as you also explained it, that's why I was confused. Most likely I understood the initial question wrongly.

0
19 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

So you want to treat the BMI category as a covariate. This can be done with:

bmi <- c(0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3)
design <- model.matrix(~bmi)

But is this biologically sensible? Do you really expect that (log-)expression will increase in a straight line with your BMI categories? There is no reason to me that, for categories defined by arbitrary thresholds, e.g., a gene would increase by 2-fold in category 1 vs 0, by another 2-fold in category 2 vs 1, and so on. You have plenty of replicates so it is better to fit a one-way layout using the BMI category as a grouping factor:

fbmi <- factor(bmi)
design2 <- model.matrix(~0 + fbmi)
colnames(design2) <- levels(fbmi)

You can then do an ANOVA-like analysis to identify genes with DE between any groups:

con <- makeContrasts(fbmi1 - fbmi0, fbmi2 - fbmi1, fbmi3 - fbmi2, levels=design)

... and then look for significant genes where the signs of the log-fold changes are all the same across the three comparisons. This will identify genes that are monotonic increasing with respect to higher BMI categories, which is more flexible than trying to model expression on a straight line. Other approaches include getting the original BMI values themselves and fitting a spline to those.

As for your DESeq2 question, you're specifying groupe as a factor, so obviously you'll get different results when comparing different groups. If you were treating it as a covariate, you wouldn't even be able to compare different groups.

Many thanks for quick and absolutely helpful response.

Sorry for my first noobish question, but the contrasts you modeled can then be added within the limma pipeline right? I am not familiar with limma yet, but with your first model I should get for every gene a regression coefficient and p.value telling me whether I see an linear increased/decreased expression?

Sure, I am not very confident how (biological) meaningful the results will be, but I have a second analysis in mind for which I would like to have these data (and thus I just want to try it out).

However, I really like your second approach! I will definitely try this as well (and again this contrast can then be included in limma?).

For DESeq2, I understood it as you also explained it, that's why I was confused. Most likely I understood the initial question wrongly.

If you use the first model, the log-fold change will represent the gradient of log-expression with respect to the BMI category. What this actually means is not clear - the BMI category is unitless and not strictly quantitative, compared to other covariates like time (in which case the log-fold change would represent the rate of expression change over time). The p-value comes from testing the null hypothesis that the gradient is equal to zero.

Presumably, I do not get it right. BMI was also just an example and in essence we just have 4 groups in which the subjects were put. For these 4 groups we would like to perform a linear regression to find genes which progressively increase from one group to another. I am not sure whether this is covered by the design you mentioned before? Would it be absolutely inappropriate to use e.g. the lm() (or glm()) function on DESeq2 normalized counts (or other normalizations) for this purpose?

The problem is that you need to say how the genes increase in expression. If you use the first model, you're saying that the log-fold change from group 0 to 1 is the same as the log-fold change from group 1 to 2. I find it hard to believe that this would be true for many genes. While the direction of change may be the same, there is no reason that the actual log-fold change should be the same. This results in misspecification of the model and inflated variances.

Now, some misspecification might be okay if the categories actually meant something. But they don't, at least not in a quantitative sense. If you were using the BMI itself as a covariate, then you could at least say that the regression coefficient represents the log-fold change per unit of BMI. Here, it's the log-fold change per "unit of category", which doesn't make any sense, e.g., if you have categories 0 and 1, what does a BMI category of 0.5 mean?

This is even more pertinent if the cutoffs for the BMI categories are arbitrary. Let's say we have a person whose BMI is such that he/she just gets assigned into group 1 instead of group 0. Let's say we have another person whose BMI is such that he/she just gets assigned into group 0 instead of group 1. So, despite the fact that these two people have minuscule differences in their BMIs, the model will give them the same log-fold change as other pairs of people in groups 1 vs 0 with big differences in their BMIs. This renders any quantitative interpretation of the category meaningless.

Alright then. Thanks a lot for your patient and help! I think I got your point and this makes a lot of sense. We have to rethink the idea. However, the BMI example might had been confusing. We have 4 groups categorized to athletes, lean (no exercise), obese, and type 2 diabetics. We wanted to check whether there are genes which increase from athletes to type 2 diabetics or vice versa. We thought that there might have been some interesting genes. Thanks again!