Question

RNAseq paired-sample differantial expression analysis with many samples

0

Entering edit mode

Luca • 0

@lucapiacentini-9597

Last seen 6 months ago

Italy

Hello, I have to perform a differential expression analysis (RNAseq) for a condition of interest (condition) for 160 paired samples (subjects). I have followed the classic design of ~subjects+condition. However, using both edgeR and DESeq2 I get the error message "the model matrix is not full rank" so some coefficients cannot be estimated and therefore I cannot get a result. Is there any way to overcome this problem? Thank you in advance!

DESeq2 edgeR • 2.0k views

ADD COMMENT • link 15 months ago • updated 14 months ago Luca • 0

0

Entering edit mode

It is a common error that you may find the reason why by googling it. Condition and subjects variables are nested. You cannot perform such linear model with your study design. If you show your study design, maybe we could help you to choose the best statistical model to apply.

ADD REPLY • link 15 months ago Basti ▴ 780

0

Entering edit mode

OK thank you! Below is an example of the experimental design, including a multi-level factor ("condition", 5 levels: A, B, C, D and E) and the variable "time" expressed as "before" and "after". Is it possible to define the "after-before" contrast for any combination of "condition" (e.g. after-before in AA couples, or "after-before" in CB couples, etc.) for the variable "condition" considering that it is paired data (160 paired samples)? I hope I have made the problem sufficiently clear. Thanks again.

Design:

sample_ID   time condition
1  sample_01 before         A
2  sample_01  after         A
3  sample_02 before         B
4  sample_02  after         C
5  sample_03 before         B
6  sample_03  after         B
7  sample_04 before         C
8  sample_04  after         C
9  sample_05 before         C
10 sample_05  after         C
11 sample_06 before         D
12 sample_06  after         D
13 sample_07 before         A
14 sample_07  after         A
15 sample_08 before         B
16 sample_08  after         C
17 sample_09 before         D
18 sample_09  after         C
19 sample_10 before         E
20 sample_10  after         D
21 sample_11 before         E
22 sample_11  after         D
...

ADD REPLY • link 15 months ago Luca • 0

score 0 · Answer 1 · 2024-01-08

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

The reason why you have an error is that you have attempted to treat a between-patient factor (condition) as if it was a within-patient factor. You could have used ~patient+time but not ~patient+condition. The correct analysis approach is outlined in the edgeR User's Guide in the Section on between and within subject factors, although the example in the User's Guide is simpler than your data.

It is certainly possible to define the "after-before" contrast separately for each combination of before-after conditions. Before I show you how to do that, I would appreciate a bit more explanation about your experimental design. What does sample-id correspond to? To patient? To couple? If have never before seen a paired experimental design where the before and after samples for the same patient have different conditions. How is that possible? How many condition combinations are there (AA, BB, BC etc) and how many instances of each?

Give the large number of RNA-seq samples, I recommend that you analyse this experiment in limma rather than in edgeR or DESeq2. limma will not only be much faster, it would also give you more flexibility for things like quality weights, which are often required for human samples. The new edgeR v4 QL pipeline is also a possibility but will still not be as fast as limma.

ADD COMMENT • link 15 months ago • updated 14 months ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon Smith, thank you first of all for your reply. I have designed models for paired data before but here the design is a bit more complicated due to the variables involved and the large number of samples. For the latter reason I was already planning to rewrite the code for limma-voom because both DESeq2 and edgeR, even with a simpler design, require extremely long computational times. But still the design of the model poses some problems for me. However, I also understand that the presented experimental scheme is only a simplification of the actual dataset and therefore I will try to present it again as far as possible in a more understandable (I mean biologically understandable) way below by adding the requested clarifications. Now, "specimen" represents each individual sequenced sample; "patient" (corresponding to sample_ID in the previous example) indicates the pairing (same patient evaluated at baseline and at follow up); "time" is the period when the specimen was collected, i.e. at baseline or at follow up; and "disease_grade" is the different grades of a disease from better to worse (0 to 4). Some patients may have the same disease grade at both baseline and follow-up, while others may improve or worsen over time. So, I want to test the gene expression differences in patients who improve, worsen and do not change their disease grade. Obviously, I am not interested in all possible combinations of changes in disease grade at follow-up compared to baseline, which is effectively 25 combinations, but more realistically I need to test 8 different combinations, which corresponds to my real data. For example, I might have 18 patients (i.e. 36 paired samples) who change their disease grade from grade_1 (baseline) to grade_2 (follow_up), 17 patients (i.e. 34 paired samples) who change their disease grade from grade_2 (baseline) to grade_3 (follow_up), and 19 patients (38 paired samples) who improve their disease grade by having grade_3 at baseline and grade_2 at follow_up... and so on. I hope I have not bored you too much with this experimental problem, but I would really appreciate any feedback on this issue. Thanks.

      specimen    patient      time disease_grade
1   specimen_1 patient_01  baseline       grade_0
2   specimen_2 patient_01 follow_up       grade_0
3   specimen_3 patient_02  baseline       grade_1
4   specimen_4 patient_02 follow_up       grade_2
5   specimen_5 patient_03  baseline       grade_1
6   specimen_6 patient_03 follow_up       grade_1
7   specimen_7 patient_04  baseline       grade_2
8   specimen_8 patient_04 follow_up       grade_2
9   specimen_9 patient_05  baseline       grade_2
10 specimen_10 patient_05 follow_up       grade_2
11 specimen_11 patient_06  baseline       grade_3
12 specimen_12 patient_06 follow_up       grade_3
13 specimen_13 patient_07  baseline       grade_0
14 specimen_14 patient_07 follow_up       grade_0
15 specimen_15 patient_08  baseline       grade_1
16 specimen_16 patient_08 follow_up       grade_2
17 specimen_17 patient_09  baseline       grade_3
18 specimen_18 patient_09 follow_up       grade_2
19 specimen_19 patient_10  baseline       grade_4
20 specimen_20 patient_10 follow_up       grade_3
21 specimen_21 patient_11  baseline       grade_4
22 specimen_22 patient_11 follow_up       grade_3
...

ADD REPLY • link 15 months ago Luca • 0

3

Entering edit mode

First, you need to determine all the grade combinations, which can be done by:

nsamples <- length(disease_grade)
GradeComb <- as.character(disease_grade)
dim(GradeComb) <- c(2,nsamples/2)
GradeComb <- apply(GradeComb,2,function(x) paste(x,collapse="."))
GradeComb <- factor(rep(GradeComb,each=2))

Then make the design matrix:

design.patients <- model.matrix(~0+patient)
colnames(design.patients) <- levels(patient)
design.grades <- model.matrix(~0+GradeComb)
colnames(design.grades) <- levels(GradeComb)
design <- cbind(design.patients, design.grades * (time=="follow_up"))

The non-patient coefficients of the design matrix will then correspond to before-after effects for each of the grade combinations. The code may look strange, but the design matrix is exactly the same as if you formed the standard paired comparison design matrix for each grade combination separately and then merged the columns into one matrix.

The DE analysis can be done quickly in limma. In limma, the minimal pipeline would be (assuming y to be an unfiltered DGEList object):

keep <- filterByExpr(y, group=GradeComb)
yfilt <- y[keep,, keep.lib.sizes=FALSE]
yfilt <- normLibSizes(yfilt)
fit <- voomLmFit(yfilt, design)
fit <- eBayes(fit, robust=TRUE)

Then

topTable(fit, coef="grade_0.grade.0")

finds genes are the DE at the followup for patients with grade 0 at specimens at both times.

Using edgeR v4 (the latest version that was part of the Bioconductor release last October), the minimum pipeline would be

keep <- filterByExpr(y, group=GradeComb)
yfilt <- y[keep,, keep.lib.sizes=FALSE]
yfilt <- normLibSizes(yfilt)
qfit <- glmQLFit(yfilt, design, legacy=FALSE)
qtest <- glmQLFTest(qfit, coef="grade_0.grade_0")
topTags(qtest)

Note that you do not need to run estimateDisp in the edgeR v4 QL pipeline. Setting legacy=TRUE would return to the edgeR v3 pipeline which does require estimateDisp, and would be much slower.

ADD REPLY • link 15 months ago • updated 14 months ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon Smith,

thank you very much for your comprehensive and detailed reply.

I think I have figured out the "trick" used here, and thanks to your suggestions everything now seems to work correctly. The only thing I would like to report to you as a user is that, unlike limma-voom, even with the latest version of edgeR, the function for glmQLFit still runs quite slowly for a large sample size. However, this is not a problem for me at the moment, I have received valuable hints on how to answer my initial question. I will now work around this model to see if I need to include other variables, e.g. known or unknown confounders, analyse the results and find the biological sense.

Thanks again!

ADD REPLY • link 15 months ago Luca • 0

1

Entering edit mode

To try a scenario similar to your data, I simulated a dataset with 160 samples and 12000 genes and a design matrix with 91 columns. The edgeR v4 code shown above took a little less than 8 minutes on my laptop. The glmQLFit step took 6.5 minutes.

ADD REPLY • link 14 months ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon for the clarification.

In my case I had to work on a dataset with >26000 genes for 160 samples and a design with 169 columns and I was working on a common esa core CPU (2.40 GHz, 16 GB RAM) laptop. This is probably the reason why glmQLFit took longer for me.

Anyway, thanks for the new tip.

ADD REPLY • link 14 months ago Luca • 0

0

Entering edit mode

Sorry, I was wrong to say that edgeR v4 would be very fast on your dataset. That would have been true had your experiment been a oneway layout. For an experiment like yours with a complex design matrix, the glmQLFit step will take somewhat longer but should still be doable.

ADD REPLY • link 14 months ago Gordon Smyth 52k