Differential expression analysis: Continuous variables
Entering edit mode
Radek ▴ 80
Last seen 2.8 years ago


I am facing a tricky problem of experimental design that I would like to share with you. 

We are working with several groups of samples, each of them containing sorted cells of the same cell type. In groups A and B, we have a certain percentage of the cells which have been modified in different ways (=continuous value). Group C is a control group without any modification.

The goal of this experiment is to assess the impact of the modification on the cells by comparing A vs C and B vs C while controlling for the number of modified cells in each group.

In addition, we don't know if this modification has a linear impact on the cells: i.e. a continuous value of 1% could have 10 times less impact than a continuous value of 10% which could itself have 1000 times less impact than a 20% of cells.

The design table:

sample    group    continuous_value(%)

sample_a    A    35
sample_b    A    10
sample_c    B    1
sample_d    B    4
sample_e    C     0
sample_f    C    0

I have already tried two different approaches with DESeq2 to work on this dataset:

a. Use the continuous value as a numeric value in the design (with or without log2 transformation). I obtained some differentially expressed genes but have no idea how to say that it was the right thing to do. 

b. Transform the value into small bins. Unfortunately this did not work ("Error in DESeqDataSet(se, design = design, ignoreRank) : the model matrix is not full rank, so the model cannot be fit as specified.one or more variables or interaction terms in the design formula are linear combinations of the others and must be removed"). As you can see my controls are always at 0, my group B is quite low and the % in the group A is alway higher than the rest. Moreover, we currently don't have enough biological evidences to say: from this % to this one the cells can be fitted in the same box.

Since I don't understand enough all the mechanics behind linear model I would like to have some advices on this design. How can I compare my groups while controlling for this variable number of cells? (assuming or not that the % has a linear impact). 

Up to know I have used DESeq2 but I could also use other packages if they are more suitable for this kind of messy design.


Sorry for this long post,

Thanks in advance for your answers!


DESeq EdgeR limma • 5.7k views
Entering edit mode

You should be clear on whether that is your entire design or just a sample of six rows, because it's going to make a big difference exactly how many samples you have, as well as how many discrete values of the continuous covariate. Also, you should show the analysis code that your tried that produced your error, and in particular show the design your tried to use. Lastly, you should mention what hypotheses you wish to test. For instance, do you want to test for/are you expecting differences in the relationship between the continuous variable and expression in groups A & B?

Entering edit mode

A. The design showed here is only a sample of the full design. In reality there are four groups (A to C are modified and D is the Control). I wanted to keep that as simple as possible (A[n=4], B[n=5], C[n=3], D[n=3]). 

Samples    Group    Continous Variable    Bins
sample_1    A    6.63    E
sample_2    D    0    A
sample_3    B    3.66    D
sample_4    C    1.27    C
sample_5    D    0    A
sample_6    D    0    A
sample_7    B    1.74    C
sample_8    B    0.14    B
sample_9    B    0.58    B
sample_10    A    13.34    F
sample_11    C    3.23    D
sample_12    C    0.52    B
sample_13    A    7.49    E
sample_14    B    0.06    B
sample_15    A    41    F

B. Here is my code:

# The colData is the table above. I have tried more than 20 different combinations for the bins.

colData <- read.table("~/../colData.txt", sep="\t", header=T)
colData <- colData[,-3]
design <- ~ Bins + Group

# Table containing the number of reads calculated with FeatureCount
Matrix = Matrix
dds <- DESeqDataSetFromMatrix(Matrix,colData,design)

"Error in DESeqDataSet(se, design = design, ignoreRank) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  one or more variables or interaction terms in the design formula
  are linear combinations of the others and must be removed"

dds <- DESeq(dds)



C. I hope my answer is what you were expecting: 

I would like to test for differences between the modifications A and the B and all other combinations (AvsD, BvsC,...). I am not currently interested to know the impact of this continuous covariate. I just want to use it to normalise my number of reads the best way possible. 

I am expecting that 1% in group A will be equivalent to 1% in group B (this is an assumption). However, I am expecting that from 1 to 10% I won't have a 10 times relationship but something more complicated that I can't currently assess. 

Entering edit mode

The full-rank error message refers to the fact that your groups are confounded by your bins. Group D has the same samples as bin A, while group A is comprised of all samples from bins E and F. This means that any DE between groups cannot be distinguished from an uninteresting bin-related effect (i.e., bin and group coefficients are redundant, such that design does not have full column rank). In essence, the problem is the same as that mentioned in my answer below.

Entering edit mode

Yes I understand why I got this error but I am not sure to understand what it implicates. Could you correct me if I am wrong?: This clearly means that the bin solution cannot apply to this design since I won't be able to come up with bins that are different enough from the groups to be relevant. in the design But does this mean also that transformation of the numeric values as described by Aaron Lun won't be applicable?

Entering edit mode

A confounded experimental design is when two variables are identical and therefore redundant. If a bin is the same as a group, then it is impossible to tell whether any effect on those samples is due to the bin or the group, and this impossibility results in the rank-deficiency error. Aaron's solution using natural splines retains the continuous nature of the variable, which means that it won't be confounded with the discrete groups.

Entering edit mode
Aaron Lun ★ 27k
Last seen 27 minutes ago
The city by the bay

Hmm. The main problem with your set-up is that your ABC groupings are correlated with the continuous variable. In particular, group A has the highest variable values, while group B has lower values and group C has values of zero. Now, if you observe an increasing trend in expression from A to B to C, there's no way to know whether that's due to genuine DE between groups or whether it's being driven by the continuous variable (especially if you don't know how expression responds to changes in the continuous variable, i.e., the shape of the trend).

As a result, just putting in the continuous variable as a coefficient in the design matrix will probably be liberal. Significant deviations from the fitted line will be considered to represent DE between groups, but they may well be due to an uninteresting non-linear response in expression with respect to the continuous variable. If you want to be more conservative, you could do something like this:

group <- rep(c("A", "B", "C"), each=2)
cont.var <- c(35, 10, 4, 1, 0, 0)
X <- splines::ns(cont.var, df=2)
design <- model.matrix(~0 + group + X)

This fits a cubic spline to the expression values against changes in the continuous variables, which is a bit more flexible than the linear trend you would get from using cont.var directly in design. As a result of this flexibility, both the full and null models (when comparing groups) will fit equally well. This will reduce significance and avoid detecting spurious differences between groups that are driven by cont.var, at the cost of losing power to detect genuine DE.

Note that the choice of df=2 is simply because there's not enough points for more spline coefficients. Increasing df would saturate the model and preclude dispersion estimation. Ryan's suggestion is relevant here; if you have more samples, you should show them, because a more flexible setting of df may be appropriate.

You could also log-transform cont.var prior to using ns; this might provide better performance by making the spacing between points more similar, such that the outlier at 35 doesn't dominate the fit. Make sure you add a prior to avoid undefined values at zero, though.

Entering edit mode

Thanks, indeed I have more samples than those 6. I wanted to be as clear as possible in this long post and I lost some information. They have been added to Ryan's answer. 

I am afraid this is the first time I am seeing the spline function. I will have a look to what it means. Could you please try to explain me how you can choose the best value for the df parameter? Knowing that I have more samples.


Entering edit mode

If you read page 49 of the limma user's guide, it'll tell you that anything in the range of 3 - 5 is generally reasonable. This should provide enough flexibility to fit a smooth curve to the expression values against the covariate, yet not enough to result in overfitting and saturation of the model. In your case, you could probably go with df=5, with plenty of residual d.f. in the design left for dispersion estimation. It's difficult to be more quantitative as you'd have to evaluate the fitted spline for each gene separately (e.g., with cross-validation), which is a bother for several thousand genes.

Entering edit mode
Radek ▴ 80
Last seen 2.8 years ago

Thanks a lot to all of you! I will give a try to what you advised.

Entering edit mode
Radek ▴ 80
Last seen 2.8 years ago


I finished to test your advices. Unfortunately the MDS plots I have seen are really not great and I could not detect any gene differentially expressed after the applying some contrasts. Could you tell me if the code provided makes sens to you?

Moreover, I was a bit surprise by the results of the spline function who added not one but five columns to the design table (df=5). Could you tell me if this is expected? 


# Load the data (column information in colData and matrix of count in matrix)
colData <- read.table(..., sep="\t", header=T, row.names=1)
Matrix <- read.table(..., sep="\t", header=T, row.names=1)
Matrix <- DGEList(Matrix)

# Design matrix
group <- factor(colData$group)
cont.var <- splines::ns(as.vector(colData$cont.var), df=5)
design <- model.matrix(~0+group+cont.var)
colnames(design) <- c("A", "B", "C", "D", "df1", "df2", "df3", "df4", "df5")

  A B C D df1 df2 df3 df4 df5


0 0 0 1






2 1 0 0 0 0 0 0 0 0
3 0 0 1 0






4 0 1 0 0






# Normalization of the reads with TMM
​Matrix <- calcNormFactors(Matrix, method="TMM")

# Remove the entries with not enough information
isexpr <- rowSums(cpm(Matrix) > 10) >= 2
Matrix <- Matrix[isexpr,]

# Voom 
Matrix <- voom(Matrix, design, plot=TRUE)
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plotMDS(Matrix, xlim=c(-2.5,2.5), col= c(1,3,2,4,3,3,2,2,2,1,4,4,1,2,1), pch=c(16,18,17,19,18,18,17,17,17,16,19,19,16,17,16), cex=1.2, xlab="Leading logFC dim 1", ylab= "Leading logFC dim 2")
legend("topright", legend=c("D", "C", "A", "B"), pch=c(16,17,18,19), cex=1.2, title= "Groups", col= c(1,2,3,4), inset=c(-0.25,-0,2,0))

# Fit the GLM model, the contrasts and get the results
fit <- lmFit(Matrix,design)
contrast.matrix <- makeContrasts(A-B, A-C, ... , levels = design)
fit <- contrasts.fit(fit, contrast.matrix)
fit.de <- eBayes(fit)
topTable(fit.de, adjust="BH",number=50, coef=1) 


Thanks in advance

Entering edit mode

Yes, the ns function should generate one column for each df you request. Each column represents a degree of freedom. Based on that MDS plot, I would expect to find at least a few significantly differentially expressed genes, since there are clearly some non-overlapping groups. What is your threshold of significance?

Entering edit mode

Those are my top DE genes for the comparison that really should show DE genes (Cells without any modification vs cells that I could say are "highly modified"). It does not make any biological sens that we don't find DE genes....

                        logFC   AveExpr         t      P.Value adj.P.Val         B
ENSXXXXXXXXXXXX -1.4390457  8.197525 -4.714812 0.0006753215 0.9996464 -4.231704
ENSXXXXXXXXXXXX  0.8422625  7.662963  3.824693 0.0029373108 0.9996464 -4.295244
ENSXXXXXXXXXXXX -2.4209089  7.120786 -3.734853 0.0034254871 0.9996464 -4.313392
ENSXXXXXXXXXXXX  1.8776982  5.896281  3.955525 0.0023518266 0.9996464 -4.331954
ENSXXXXXXXXXXXX -0.8859719  6.549924 -3.509476 0.0050557930 0.9996464 -4.339051

And results for the comparison group D vs C 

contrast.matrix <- makeContrasts(C-D, levels = design)

                        logFC  AveExpr         t      P.Value adj.P.Val           B
ENSXXXXXXXXXXXX  2.7806248 7.120786  5.151043 0.0003417341 0.5228458  0.05779204
ENSXXXXXXXXXXXX  1.4355024 4.651184  5.418768 0.0002280097 0.5228458 -0.23711058
ENSXXXXXXXXXXXX -1.6945085 4.920961 -5.141404 0.0003468148 0.5228458 -0.27847172
ENSXXXXXXXXXXXX  2.8111010 3.878004  5.256179 0.0002911711 0.5228458 -0.65622061
ENSXXXXXXXXXXXX  1.5989977 4.247807  5.093886 0.0003730616 0.5228458 -0.65657149

I find DE genes based on the p-val (<0.05) but not after adjustment. 




Entering edit mode

Have a look at this link and see if any of the issues apply to your data: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

Entering edit mode

Thanks for the link.

Here are the plots: (First one is the test contrast C-D and second is the one I expect to contain a lot of DE genes).

If I understand well, this kind of distribution would mean a statistical test not appropriate. Could you guide me to some tests I could run to delimit the problem? 

Thanks for your patience and your help.


Entering edit mode

You should use a larger bin size for your histograms. I generally use somewhere between 20 and 100 bins (bin width between 0.01 and 0.05). Regardless, these p-value plots are clearly depleted of small p-values relative to a uniform distribution, which does indeed indicate an issue with the test. One possibility is that that there is some unexplained source of variation that is not accounted for by your model. Or perhaps your continuous variable is too highly correlated with your groups and is absorbing all the real inter-group differences, leaving the real contrasts with nothing. I would try running the model with only the group variable, and I would try running sva on the data. I've seen good results in the past with using sva to resolve over-conservative p-value distributions.

Entering edit mode

Thanks for this suggestion.

I tried to run the sva package but without success. 

Do you have any idea how to solve it? Or should I open a new support post?


# load the expression dataframe and the phenotype informations
Matrix <- read.table(paste(Matrix_link), sep='\t', header=T, row.names=1)
colData <- read.table(paste(colData_link), sep="\t", header=T, row.names=1)
# Create the two models.matrix objects
mod <- model.matrix(~as.factor(group), data=colData)
mod0 <- model.matrix(~1, data=colData)
# Subset against the genes with lowest counts
isexpr <- rowSums(cpm(Matrix) > 10) >= 2
Matrix <- Matrix[isexpr,]
edata <- as.matrix(Matrix)
# Apply the sva function
n.sv = num.sv(edata,mod,method="leek")

svobj = sva(edata,mod,mod0,n.sv=n.sv)

Number of significant surrogate variables is:  11 
Iteration (out of 5 ):
Erreur dans density.default(x, adjust = adj) : 'x' contains missing values
De plus : Message d'avis :
In pf(fstats, df1 = (df1 - df0), df2 = (n - df1)) : production de NaN


I thought it could be a problem with my matrix of count and I tried to solve the problem by removing the rows containing some 0 but it did not change anything.


Entering edit mode

It's not clear, but it looks like you might have run sva on the raw counts. You're supposed to run it on the logCPM, which you can get from the cpm function.

Entering edit mode

Lack of detection is, perhaps, not unexpected; see my comments above about loss of power with your experimental design, given that your group-wise differences are moderately confounded by the continuous variables. Nonetheless, the p-value distribution should still be roughly uniform under the null hypothesis (and also under the alternative, due to the issues discussed, so you mightn't see a spike at low p-values). One cause of the skewed distribution would be a batch effect, so you could try Ryan's approach to try to resolve this. However, I don't know how reliable the final results would be, given how many confounding (and hidden) factors are floating around in the data.


Login before adding your answer.

Traffic: 254 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6