a complicate scenario for Design Matrix
1
0
Entering edit mode
@kainblue100-10580
Last seen 8.0 years ago

I am using edgeR for DEG analysis, the following is a description of my data:

11 patients, taken two types of tissue from each patient, each type of tissue taken both affected region and unaffected region. For example, for patient01, tissue1_affected, tissue1_unaffected, tissue2_affected, tissue2_unaffected, four samples are taken from this patient

3 other patients, taken only one type of tissue from each of these three patients, both affected region and unaffected region. For example, for patient12, tissue1_affected, tissue1_unaffected

3 normal people, taken two types of tissue from each normal, people. For example, for normal01, tissue1_unaffected,  tissue2_unaffected, two samples are taken from this person

I am interested in comparing all the unaffected tissue1 with all the affected tissues1 across different subjects including both from the same subject or different subjects scenario

The same for tissue2

How to design one matrix for this? And how will the contrast matrix look like?

If use each person as a block, and combine the tissue type and affected status as treatment, I am thinking the following design matrix:

block=c(001,002,..........)

treatment=c(tissue1_affected,tissue1_unaffected,tissue2_affected,tissue2_unaffected,........)

design=model.matrix(~0+treatment+block)

The design matrix column names (coefficients) will be look like this:

tissue1_affected tissue1_unaffected tissue2_affected tissue2_unaffected 001 002 ..........

Then I realize in the design matrix I am missing one coefficient which stand for one block person

If I want to get the DEG from comparing affected tissue1 with unaffected tissue1, should I just give the contrast matrix as

for tissue1:  contrast=c(-1,1,0,0.........) (the rest are all 0s)

for tissue2:  contrast=c(0,0,-1,1,0.........) (the rest are all 0s)

 

Thanks 

Thanks a lot

 

 

 

 

edgeR design matrix • 839 views
ADD COMMENT
0
Entering edit mode

For starters, put the edgeR tag in your post, otherwise we won't get notified about your question. And if you want to add information, either edit your post or add a comment, rather than making a new answer.

ADD REPLY
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen just now
The city by the bay

There's likely to be a patient-specific effect, which means you'll have to block on the patient. This means that you won't be able to do comparisons of affected/unaffected samples between different subjects in edgeR. (If you want that, you'll have to look at duplicateCorrelation in limma after using voom.) However, you can still compare within subjects, using the 14 patients that have both affected and unaffected samples for tissue 1. Let's set up some demonstration code, ignoring the healthy patients without affected samples:

patient <- factor(c(rep(1:11, each=4), rep(12:14, each=2)))
tissue <- factor(c(rep(rep(1:2, each=2), 11), rep(1, 6)))
affected <- factor(rep(c(TRUE, FALSE), 25))

Now there's a couple of ways to proceed, but the most flexible would be to do something like this:

patient.tissue <- paste0("P", patient, ".T", tissue)
design <- model.matrix(~ 0 + patient.tissue + tissue:affected)
design <- design[,-grep("FALSE", colnames(design))] # to get to full rank

This basically defines a blocking factor for each patient/tissue combination, i.e., each tissue in each patient is allowed to have its own "baseline" expression level. If we look at colnames(design), this is what the first 25 terms are:

 [1] "patient.tissueP10.T1" "patient.tissueP10.T2" "patient.tissueP11.T1"
 [4] "patient.tissueP11.T2" "patient.tissueP12.T1" "patient.tissueP13.T1"
 [7] "patient.tissueP14.T1" "patient.tissueP1.T1"  "patient.tissueP1.T2"
[10] "patient.tissueP2.T1"  "patient.tissueP2.T2"  "patient.tissueP3.T1"
[13] "patient.tissueP3.T2"  "patient.tissueP4.T1"  "patient.tissueP4.T2"
[16] "patient.tissueP5.T1"  "patient.tissueP5.T2"  "patient.tissueP6.T1"
[19] "patient.tissueP6.T2"  "patient.tissueP7.T1"  "patient.tissueP7.T2"
[22] "patient.tissueP8.T1"  "patient.tissueP8.T2"  "patient.tissueP9.T1"
[25] "patient.tissueP9.T2"  "tissue1:affectedTRUE" "tissue2:affectedTRUE"

The last two terms are the things you're interested in; these represent the DE between affected and unaffected for tissue 1 or tissue 2. If you drop either, you can test for affected/unaffected DE in the corresponding tissue. Or, you can set a contrast vector to compare them, then you can find "differential DE" between tissues, i.e., genes with tissue-specific responses upon the tissue being "affected".

There's a couple of points that warrant some more explanation:

  • We're not using the healthy individuals at all in the above design. This is because, after blocking on the individuals in edgeR, we can't compare directly between samples from different individuals. Thus, if you want to look for differences between affected and unaffected samples, the healthy individuals won't be much help, because they don't have affected samples with which to compare to within the same individual. If you want to compare between tissues, then the healthy individuals will become more useful.
  • Another parametrization of the design matrix is ~ 0 + patient + tissue + tissue:affected. This will allow for direct comparisons between tissues, but it also assumes that the tissue 1/tissue 2 log-fold change is consistent between patients. If you're not interested in directly comparing expression between tissues, then stick with my first design to allow the model to be more flexible. (Note that the "differential DE" compares DE between tissues, not direct expression.)
ADD COMMENT
0
Entering edit mode

Thank you so much. This is really helpful. Actually I also did the analysis in voom, it is easier to understand. But edgeR gives me so many confusion. So basically for edgeR the normal subjects are not helpful, should I just using all the paired samples from patients? Or at least the normal subjects are used in the biological variation estimation step?

Thanks

ADD REPLY
0
Entering edit mode

The parametrization of the design matrix is more or less the same between edgeR and voom, so if you understand how it works in voom, then you should understand how it works in edgeR. Regardless of which method you use, if you block on the subject in the design matrix, then you will not be able to compare between subjects. As such, for your DE comparison between affected and unaffected tissues, the normal subjects will not be helpful.

In fact, if you're using my first design, then the normal subjects won't even contribute to the estimation of the biological variance, because each sample ends up with its own tissue:patient term. In the second design, the normal subjects will contribute to variance estimation, but that comes at the cost of reduced model flexibility. Given that you've got plenty of residual degrees of freedom in both designs, I wouldn't change the design just to scrounge up a few more residual d.f. for variance estimation. The choice should be driven by whether you want to compare between tissues or not.

Finally, the main difference in experimental parametrization between edgeR and voom is that, in the latter, you can instead choose to block on the subject in duplicateCorrelation. This will allow you to compare directly between subjects, but this convenience doesn't come for free. There are some assumptions involved (e.g., homogeneous subject effects, common correlation across genes), so you should decide whether you need it before you use it.

ADD REPLY
0
Entering edit mode

It is really a brilliant discussion! It is my honor to know you in this forum. Thank you for all the detail explanation which inspire me to another level of the RNA-seq DEG analysis. I do believe the design part is the key component and it is rely on a deep understanding of the statistical model.  

ADD REPLY

Login before adding your answer.

Traffic: 472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6