Question

a complicate scenario for Design Matrix

0

Entering edit mode

kainblue100 • 0

@kainblue100-10580

Last seen 8.0 years ago

I am using edgeR for DEG analysis, the following is a description of my data:

11 patients, taken two types of tissue from each patient, each type of tissue taken both affected region and unaffected region. For example, for patient01, tissue1_affected, tissue1_unaffected, tissue2_affected, tissue2_unaffected, four samples are taken from this patient

3 other patients, taken only one type of tissue from each of these three patients, both affected region and unaffected region. For example, for patient12, tissue1_affected, tissue1_unaffected

3 normal people, taken two types of tissue from each normal, people. For example, for normal01, tissue1_unaffected, tissue2_unaffected, two samples are taken from this person

I am interested in comparing all the unaffected tissue1 with all the affected tissues1 across different subjects including both from the same subject or different subjects scenario

The same for tissue2

How to design one matrix for this? And how will the contrast matrix look like?

If use each person as a block, and combine the tissue type and affected status as treatment, I am thinking the following design matrix:

block=c(001,002,..........)

treatment=c(tissue1_affected,tissue1_unaffected,tissue2_affected,tissue2_unaffected,........)

design=model.matrix(~0+treatment+block)

The design matrix column names (coefficients) will be look like this:

tissue1_affected tissue1_unaffected tissue2_affected tissue2_unaffected 001 002 ..........

Then I realize in the design matrix I am missing one coefficient which stand for one block person

If I want to get the DEG from comparing affected tissue1 with unaffected tissue1, should I just give the contrast matrix as

for tissue1: contrast=c(-1,1,0,0.........) (the rest are all 0s)

for tissue2: contrast=c(0,0,-1,1,0.........) (the rest are all 0s)

Thanks

Thanks a lot

edgeR design matrix • 839 views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by kainblue100 • 0

0

Entering edit mode

For starters, put the edgeR tag in your post, otherwise we won't get notified about your question. And if you want to add information, either edit your post or add a comment, rather than making a new answer.

ADD REPLY • link 8.0 years ago Aaron Lun ★ 28k

score 2 · Accepted Answer · 2016-04-28

There's likely to be a patient-specific effect, which means you'll have to block on the patient. This means that you won't be able to do comparisons of affected/unaffected samples between different subjects in edgeR. (If you want that, you'll have to look at duplicateCorrelation in limma after using voom.) However, you can still compare within subjects, using the 14 patients that have both affected and unaffected samples for tissue 1. Let's set up some demonstration code, ignoring the healthy patients without affected samples:

patient <- factor(c(rep(1:11, each=4), rep(12:14, each=2)))
tissue <- factor(c(rep(rep(1:2, each=2), 11), rep(1, 6)))
affected <- factor(rep(c(TRUE, FALSE), 25))

Now there's a couple of ways to proceed, but the most flexible would be to do something like this:

patient.tissue <- paste0("P", patient, ".T", tissue)
design <- model.matrix(~ 0 + patient.tissue + tissue:affected)
design <- design[,-grep("FALSE", colnames(design))] # to get to full rank

This basically defines a blocking factor for each patient/tissue combination, i.e., each tissue in each patient is allowed to have its own "baseline" expression level. If we look at colnames(design), this is what the first 25 terms are:

 [1] "patient.tissueP10.T1" "patient.tissueP10.T2" "patient.tissueP11.T1"
 [4] "patient.tissueP11.T2" "patient.tissueP12.T1" "patient.tissueP13.T1"
 [7] "patient.tissueP14.T1" "patient.tissueP1.T1"  "patient.tissueP1.T2"
[10] "patient.tissueP2.T1"  "patient.tissueP2.T2"  "patient.tissueP3.T1"
[13] "patient.tissueP3.T2"  "patient.tissueP4.T1"  "patient.tissueP4.T2"
[16] "patient.tissueP5.T1"  "patient.tissueP5.T2"  "patient.tissueP6.T1"
[19] "patient.tissueP6.T2"  "patient.tissueP7.T1"  "patient.tissueP7.T2"
[22] "patient.tissueP8.T1"  "patient.tissueP8.T2"  "patient.tissueP9.T1"
[25] "patient.tissueP9.T2"  "tissue1:affectedTRUE" "tissue2:affectedTRUE"

The last two terms are the things you're interested in; these represent the DE between affected and unaffected for tissue 1 or tissue 2. If you drop either, you can test for affected/unaffected DE in the corresponding tissue. Or, you can set a contrast vector to compare them, then you can find "differential DE" between tissues, i.e., genes with tissue-specific responses upon the tissue being "affected".

There's a couple of points that warrant some more explanation:

We're not using the healthy individuals at all in the above design. This is because, after blocking on the individuals in edgeR, we can't compare directly between samples from different individuals. Thus, if you want to look for differences between affected and unaffected samples, the healthy individuals won't be much help, because they don't have affected samples with which to compare to within the same individual. If you want to compare between tissues, then the healthy individuals will become more useful.
Another parametrization of the design matrix is ~ 0 + patient + tissue + tissue:affected. This will allow for direct comparisons between tissues, but it also assumes that the tissue 1/tissue 2 log-fold change is consistent between patients. If you're not interested in directly comparing expression between tissues, then stick with my first design to allow the model to be more flexible. (Note that the "differential DE" compares DE between tissues, not direct expression.)