Question

paired and unpaired test at the same time with voom limma

0

Entering edit mode

dagsbio • 0

@dagsbio-23131

Last seen 2.9 years ago

Spain

Dear all,

I am trying to do a differential expression analysis for which I have 200 tumor samples and 170 of their paired normal tissues, from the same patient.

We intend to take them all (200) into the same analysis (same paper) and I am wondering how can I analyze them together in a statistically acceptable way without losing the 30 tumor samples without paired normal tissue

The metadata then would look like this (imagine with this example that I have 20 tumor samples samples instead of 200: 17 patients will be paired, 3 of them would be unpaired):

   sample patient   type
1       1       1 Normal
2       2       2 Normal
3       3       3 Normal
4       4       4 Normal
5       5       5 Normal
6       6       6 Normal
7       7       7 Normal
8       8       8 Normal
9       9       9 Normal
10     10      10 Normal
11     11      11 Normal
12     12      12 Normal
13     13      13 Normal
14     14      14 Normal
15     15      15 Normal
16     16      16 Normal
17     17      17 Normal
18     18       1  Tumor
19     19       2  Tumor
20     20       3  Tumor
21     21       4  Tumor
22     22       5  Tumor
23     23       6  Tumor
24     24       7  Tumor
25     25       8  Tumor
26     26       9  Tumor
27     27      10  Tumor
28     28      11  Tumor
29     29      12  Tumor
30     30      13  Tumor
31     31      14  Tumor
32     32      15  Tumor
33     33      16  Tumor
34     34      17  Tumor
35     35      18  Tumor
36     36      19  Tumor
37     37      20  Tumor

Can i put them all together in the same design? Is this looking right to you?

As I want to do the unpaired + paired (when available) tests at the same time, can I then define the model.matrix like this?:

design <- model.matrix(~ patient + type)

After that, the main idea is to compare the resulting log2foldchanges in order to find subgroups of patients with distinct patterns of expression, does it sound acceptable?

Thank you in advance,

David.

limma voom rna-seq paired test unpaired test • 2.8k views

ADD COMMENT • link updated 5.7 years ago by Gordon Smyth 53k • written 5.7 years ago by dagsbio • 0

0

Entering edit mode

Sorry I have just updated the question as I noticed I wasn't using R markdown guidelines.

Hope anyone could help now.

Cheers!

ADD REPLY • link 5.7 years ago dagsbio • 0

0

Entering edit mode

Sorry I have just updated the question as I noticed I wasn't using R markdown guidelines.

Hope anyone could help now.

Cheers!

ADD REPLY • link 5.7 years ago dagsbio • 0

score 2 · Answer 1 · 2020-03-25

One way to handle this is to remove Patient from the design matrix and instead use duplicateCorrelation to estimate the correlation between repeated samples from the same patient. This approach takes advantage of the pairing when it is available but also allows you to use the unpaired samples.

We used this approach for example in:

Greenawalt, D. M., Duong, C., Smyth, G. K., Ciavarella, M. L., Thompson, N. J., Tiang, T., Murray, W. K., Thomas, R. J S., and Phillips, W. A. (2007). Gene expression profiling of esophageal cancer: comparative analysis of Barrett's esophagus, adenocarcinoma and squamous cell carcinoma. International Journal of Cancer 120, 1914-1921.

where we had matched normal and esophageal samples together with some unmatched samples, i.e., the same situation you have.

If instead you include Patient in the design matrix, then the unpaired samples will not contribute to the analysis, as James has explained.

score 1 · Answer 2 · 2020-03-25

You need both observations from each pair for those samples to be worthwhile. You can see this from a simple example:

> df <- data.frame(subject = factor(rep(1:4, c(2,2,1,1))), type = factor(rep(c("Normal","Tumor"), times = 3)))
> df
  subject   type
1       1 Normal
2       1  Tumor
3       2 Normal
4       2  Tumor
5       3 Normal
6       4  Tumor
> model.matrix(~subject + type, df)
  (Intercept) subject2 subject3 subject4 typeTumor
1           1        0        0        0         0
2           1        0        0        0         1
3           1        1        0        0         0
4           1        1        0        0         1
5           1        0        1        0         0
6           1        0        0        1         1

The design matrix just outlines the coefficients you will solve for, for each subject. And each row corresponds to each subject. So you can figure out what each coefficient is algebraically. For example, the first row only has one 1, so only one coefficient. And that sample is Subject 1, normal tissue. So that's what the coefficient is.

Row 2 is Subject 1 tumor sample, and we already know what the first coefficient is, so we can solve for the other

S1T = S1N + X => X = S1T - S1N

Row 3 is

S2N = S1N + X => X = S2N - S1N

so the subject 2 coefficient is just the difference between subject 2 normal and subject 1 normal.

Row 4 is

S2T = S1N + (S2N - S1N) + X => X = S2T - S2N

so the typeTumor coefficient is the difference between the tumor and normal for subject 2.

Row 5 is

S3N = S1N + X => X = S3N - S1N

But there's no tumor sample, so all we do is compute the difference between the normals and that's it - you don't gain any insight into tumor vs normal for this sample!

Row 6 is

S4T = S1N + X + Y => X + Y = S4T - S1N

But without the normal sample from subject 4, we can't solve for both X and Y, and the results are uninformative. We know that X in this situation is supposed to be S4N - S1N, but we don't have the sample to estimate that difference. So in this simple example, only data from subjects 1 and 2 can be used to compute the mean difference between tumor and normal samples.