paired and unpaired test at the same time with voom limma
2
0
Entering edit mode
dagsbio • 0
@dagsbio-23131
Last seen 9 months ago
Spain

Dear all,

I am trying to do a differential expression analysis for which I have 200 tumor samples and 170 of their paired normal tissues, from the same patient.

We intend to take them all (200) into the same analysis (same paper) and I am wondering how can I analyze them together in a statistically acceptable way without losing the 30 tumor samples without paired normal tissue

The metadata then would look like this (imagine with this example that I have 20 tumor samples samples instead of 200: 17 patients will be paired, 3 of them would be unpaired):

   sample patient   type
1       1       1 Normal
2       2       2 Normal
3       3       3 Normal
4       4       4 Normal
5       5       5 Normal
6       6       6 Normal
7       7       7 Normal
8       8       8 Normal
9       9       9 Normal
10     10      10 Normal
11     11      11 Normal
12     12      12 Normal
13     13      13 Normal
14     14      14 Normal
15     15      15 Normal
16     16      16 Normal
17     17      17 Normal
18     18       1  Tumor
19     19       2  Tumor
20     20       3  Tumor
21     21       4  Tumor
22     22       5  Tumor
23     23       6  Tumor
24     24       7  Tumor
25     25       8  Tumor
26     26       9  Tumor
27     27      10  Tumor
28     28      11  Tumor
29     29      12  Tumor
30     30      13  Tumor
31     31      14  Tumor
32     32      15  Tumor
33     33      16  Tumor
34     34      17  Tumor
35     35      18  Tumor
36     36      19  Tumor
37     37      20  Tumor



Can i put them all together in the same design? Is this looking right to you?

As I want to do the unpaired + paired (when available) tests at the same time, can I then define the model.matrix like this?:

design <- model.matrix(~ patient + type)

After that, the main idea is to compare the resulting log2foldchanges in order to find subgroups of patients with distinct patterns of expression, does it sound acceptable?

Thank you in advance,

David.

limma voom rna-seq paired test unpaired test • 552 views
0
Entering edit mode

Sorry I have just updated the question as I noticed I wasn't using R markdown guidelines.

Hope anyone could help now.

Cheers!

0
Entering edit mode

Sorry I have just updated the question as I noticed I wasn't using R markdown guidelines.

Hope anyone could help now.

Cheers!

2
Entering edit mode
@gordon-smyth
Last seen 11 minutes ago
WEHI, Melbourne, Australia

One way to handle this is to remove Patient from the design matrix and instead use duplicateCorrelation to estimate the correlation between repeated samples from the same patient. This approach takes advantage of the pairing when it is available but also allows you to use the unpaired samples.

We used this approach for example in:

Greenawalt, D. M., Duong, C., Smyth, G. K., Ciavarella, M. L., Thompson, N. J., Tiang, T., Murray, W. K., Thomas, R. J S., and Phillips, W. A. (2007). Gene expression profiling of esophageal cancer: comparative analysis of Barrett's esophagus, adenocarcinoma and squamous cell carcinoma. International Journal of Cancer 120, 1914-1921.

where we had matched normal and esophageal samples together with some unmatched samples, i.e., the same situation you have.

If instead you include Patient in the design matrix, then the unpaired samples will not contribute to the analysis, as James has explained.

0
Entering edit mode

Fantastic, thank you!

1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

You need both observations from each pair for those samples to be worthwhile. You can see this from a simple example:

> df <- data.frame(subject = factor(rep(1:4, c(2,2,1,1))), type = factor(rep(c("Normal","Tumor"), times = 3)))
> df
subject   type
1       1 Normal
2       1  Tumor
3       2 Normal
4       2  Tumor
5       3 Normal
6       4  Tumor
> model.matrix(~subject + type, df)
(Intercept) subject2 subject3 subject4 typeTumor
1           1        0        0        0         0
2           1        0        0        0         1
3           1        1        0        0         0
4           1        1        0        0         1
5           1        0        1        0         0
6           1        0        0        1         1



The design matrix just outlines the coefficients you will solve for, for each subject. And each row corresponds to each subject. So you can figure out what each coefficient is algebraically. For example, the first row only has one 1, so only one coefficient. And that sample is Subject 1, normal tissue. So that's what the coefficient is.

Row 2 is Subject 1 tumor sample, and we already know what the first coefficient is, so we can solve for the other

S1T = S1N + X => X = S1T - S1N


Row 3 is

S2N = S1N + X => X = S2N - S1N


so the subject 2 coefficient is just the difference between subject 2 normal and subject 1 normal.

Row 4 is

S2T = S1N + (S2N - S1N) + X => X = S2T - S2N


so the typeTumor coefficient is the difference between the tumor and normal for subject 2.

Row 5 is

S3N = S1N + X => X = S3N - S1N


But there's no tumor sample, so all we do is compute the difference between the normals and that's it - you don't gain any insight into tumor vs normal for this sample!

Row 6 is

S4T = S1N + X + Y => X + Y = S4T - S1N


But without the normal sample from subject 4, we can't solve for both X and Y, and the results are uninformative. We know that X in this situation is supposed to be S4N - S1N, but we don't have the sample to estimate that difference. So in this simple example, only data from subjects 1 and 2 can be used to compute the mean difference between tumor and normal samples.

0
Entering edit mode

Dear James,

Thank you for your very detailed explanation. The problem seems much clear now analyzing it row by row.

I performed a PCA and I realized most normal samples are grouping together, then I pick a number of samples taken from the "core" of where most normal samples are grouped here :

Could I use these random normal samples from other patients as normal pairs for my 30 tumors with their missing partner? Or better to just build a unpaired separate model for those 30 samples taking 30 random normal samples from other patients?

If so, would be acceptable to compare the resulting log2foldchanges of the two analyses together? Which is actually the main idea: to somehow be able to compare all tumors after extracting their "normal" part Is there maybe a better way to do this in your opinion? (i dont have technical replicates for each patient)

Sorry for the long comment, but I can't figure a way to do this on my own.

Thank you again!

PD: not sure if you can see the PCA plot, is here: https://ibb.co/nsCxzg8