Question: Simultaneously adding two different covariates in the model gives error in DESeq2.
0
gravatar for fizer
2.7 years ago by
fizer30
US
fizer30 wrote:

Hello,

I have rnaseq data generated from brain tissues of 100 individuals, half of them were affected by a disease. All the samples were mix of two different brain regions, obtained from 4 different providers. I am trying to run DESeq2 to identify 
differentially expressed genes using the model like: ~age + sex + tissue + PMI + condition, where PMI is post-mortem interval. My colData looks like this:

                 condition age    sex    tissue    PMI     RIN    brain_bank
    A016-93        disease  79   M FCBA89  48.00 6.400000        LND
    A032-95        disease  71   F FCBA89  65.00 6.550000        LND
    A046-94        disease  70   F FCBA89 100.00 5.400000        LND
    AN00216        disease  63   M FCBA89  26.16 6.150000        KBC
    AN02738        disease  85   F FCBA89  12.80 6.200000        KBC
    SD001-07       disease  37   F     IF  46.00 6.200000        MII
    .
    .
    .
    AN03019        control  47   F FCBA89  31.85 5.300000        KBC
    SD004-10       control  50   M     IF  63.00 5.400000        MII
    A046-91       control  41   M     IF  49.00 5.450000        LND

Here condition, sex, tissue, brain_bank are as factors and age, RIN, PMI are numeric variables. The RIN values are averaged values because all the samples were sequenced in duplicates. 

I want to include brain_bank as a covariate in the model but when I this: ~age + sex + tissue + brain_bank + PMI + condition, it gives me the following error:

    the design formula contains a numeric variable with integer values,
     specifying a model with increasing fold change for higher values.
     did you mean for this to be a factor? if so, first convert
     this variable to a factor using the factor() function
    Show Traceback

    Rerun with Debug
    Error in checkFullRank(modelMatrix) : 
     the model matrix is not full rank, so the model cannot be fit as specified.
     One or more variables or interaction terms in the design formula are linear
     combinations of the others and must be removed.

I have seen this error several times but being novice in statistics I am never able to understand this error. If anyone can suggest a good reading material about it, would be great. 


Anyways following are my questions:

1) Is there some way to incorporate brain_bank as a covariate in the analysis or I should simply go for the first model?

2) Should I use mean RIN values in the model to account for the effect? Any other suggestions?

3) Not sure if this should asked as a separate question. I also want to run DEXseq in order to find diff. exon usage between cases and controls. So can I use the following model:

    ~sample + exon + exon:condition + age:exon + sex:exon + tissue:exon + PMI:exon  ## Full formula 
    ~sample + exon + age:exon + sex:exon + tissue:exon + PMI:exon   ## Reduced formula

 

 


Thanks in advance for helping.

ADD COMMENTlink modified 2.7 years ago by Michael Love25k • written 2.7 years ago by fizer30
1

"The model matrix is not full rank" means that brain_bank is not linearly independent from your other variables, so there is no way for the fitting to estimate the brain_bank effect if you also want to include all your other variables - it can be tricky to see why exactly this is, but from the excerpt of the data you provide, it looks like MII is only providing 'IF' tissues, so if this holds for the rest of the colData, then there is no way of knowing if the correlation is due to that bank or to the tissue.  You'll have to omit the brain_bank variable: this means you're bundling up the brain_bank variability into the noise, but you may be susceptible to the error of claiming genes being differential due to tissue type, when they are actually partially due to systematic differences between MII and other brain_banks (unlikely, in my experience - you have multiple banks contributing IF samples, so the main concern is the slight increase in residual noise due to not correcting for bank, but that could well be negligible).

Similarly, inclusion of other terms such as RIN is part of the craft of model-building.  Excluding it means that you might miss biologically relevant genes whose expression have a linear dependency on RIN (due to the increased residual noise); including the term means you might miss a gene with a spurious correlation with RIN but a real correlation with biology.  It's hard to give generic advice, but if there's strong correlation between technical variables such as RIN and biological variables, then you need to think much harder about what's going on.  If not, then inclusion/exclusion of the technical variables should make little practical difference (but if it does, then again, you'll need to think hard, and perhaps examine the behaviour of a few key genes that depend heavily on the choice of model).

 

ADD REPLYlink written 2.7 years ago by Gavin Kelly560
Answer: Simultaneously adding two different covariates in the model gives error in DESeq
1
gravatar for Michael Love
2.7 years ago by
Michael Love25k
United States
Michael Love25k wrote:

For the DEXSeq question, I'd recommend a new post tagged with dexseq, just because there is a lot to parse here.

I can't tell from the information you've included which are the covariates that are colinear with the others in your model. If you're not aware of the problem of colinearity, you can read the section of the vignette that talks about this (if you are using the current version of DESeq2 it will point you to this section in the error). I'd also recommend you consult with a local statistician about which models you can fit and which not (there is nothing unique about DESeq2 in this context, you can just suppose that you are performing a simple linear model with gene expression data from a single gene).

The message about numeric covariates is not an error, just a message. We want users to be careful about using numeric covariates, and to make sure they know what they are doing. When you put in a numeric covariates in a model (+ RIN, or + PMI) you are making a strong assumption that the gene expression has constant multiplicative increases or decreases along units of that numeric covariate. Read the FAQ in the DESeq2 vignette about numeric covariates. So saturation or U-shaped trends are not possible to fit. In the vignette I suggest that a safer procedure is to bin the numeric covariate into meaningful intervals, which will then be fit it as a factor. This allows more flexibility. I don't have lots of experience with using + RIN in the model, but my instinct would be to just remove samples which have very low RIN, and not using RIN in the model. 

ADD COMMENTlink written 2.7 years ago by Michael Love25k

I met with the same problems, trying to model RIN and PMI into my model. It seems that some people would regress these factors (using Combat for example) before doing a DEG analysis. But there are also paper mention that the best way is to model all factors into the design matrix and do DEG analysis directly(In this case, you need to factorize the RIN and PMI values, how to ?). Do you have any update for this kind of questions? Michael?

Thanks&Regards,

Raymond

ADD REPLYlink written 11 months ago by Raymond0

You should add covariates to the design. They can be continuous. This is the standard for DESeq2, providing original counts but using combination of offsets and estimated coefficients to control for technical factors. I don’t have any empirically motivated position about adding RIN. I don’t do this myself but I know some groups do.

ADD REPLYlink written 11 months ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 265 users visited in the last hour