edgeR in a situation with unbalanced design: to include only subjects with matching time points or not?
2
0
Entering edit mode
@cafelumiere12-7513
Last seen 6.1 years ago
United States

Edited the original question to make it a little clearer:

I have a gene expression data set of 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not. 

A subset of this actual data looks like this:

Sample Subject TimePoint Source Sample.Type_Mod Group
31 1002-1011 Baseline BM CD33_34pos Baseline.BM.CD33_34pos
32 1001-1001 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
33 1001-1006 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
34 1001-1012 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
35 1002-1002 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
63 1002-1014 Dose1Day2 PBMC CD33_34pos Dose1Day2.PBMC.CD33_34pos
64 1001-1016 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos
65 1002-1010 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos
67 1002-1013 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos

I concatenated time point, sample source and sample type as the grouping factor:

design <- model.matrix(~0+TimePoint+Source+Sample.Type)

My question is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.

Furthermore, only 5 subjects are present in both time points.

Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?

Or, do I look at all the available samples within that group and make the contrast looks something like : 

Group = factor(sampleInfoAll$Group)
design <- model.matrix(~0+Group)
my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=Dose1Day8.PBMC.CD33_34pos-Dose1Day1.PBMC.CD33_34pos,levels=design)

 

edgeR differential gene expression • 1.7k views
ADD COMMENT
0
Entering edit mode

Can I check, have you shown the whole dataset or is the table above just a subset of the data? If this is just a subset, how many subjects do you have in total?

ADD REPLY
0
Entering edit mode

Thank you for checking! No this is only a subset of the (simplified) data. The actual data set has 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not. 

A subset of this actual data looks like this:

Sample Subject TimePoint Source Sample.Type_Mod Group
31 1002-1011 Baseline BM CD33_34pos Baseline.BM.CD33_34pos
32 1001-1001 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
33 1001-1006 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
34 1001-1012 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
35 1002-1002 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos
63 1002-1014 Dose1Day2 PBMC CD33_34pos Dose1Day2.PBMC.CD33_34pos
64 1001-1016 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos
65 1002-1010 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos
67 1002-1013 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos

I concatenated time point, sample source and sample type as the grouping factor.

The challenge is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.

Furthermore, only 5 subjects are present in both time points.

Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?

Or, do I look at all the available samples within that group and make the contrast looks something like : 

Group = factor(sampleInfoAll$Group)
design <- model.matrix(~0+Group)
my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=(1/7)*Dose1Day8.PBMC.CD33_34pos-(1/20)*Dose1Day1.PBMC.CD33_34pos,levels=design)

ADD REPLY
0
Entering edit mode

Just one important point regarding your code snippet. You never need to account for sample sizes, by dividing by 7 or dividing by 20, when forming contrasts in edgeR. edgeR always accounts for sample sizes correctly by itself. If you want to compare Day8 to Day1 you just use Dose1Day8.PBMC.CD33_34pos - Dose1Day1.PBMC.CD33_34pos.

ADD REPLY
0
Entering edit mode

Thank you very much! That's right - I was thinking about group...thank you so much for pointing it out

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 11 hours ago
The city by the bay

You mention that you have multiple samples from the same subject. However, your design matrix does not contain a term for Subject. This will violate the assumption of independence for the random component of the GLM. Based on the subset of samples in your post, I see two approaches to restoring independence:

  1. Keep only the subjects with samples from both time points, and add Subject to the design matrix. (Technically there is no need to remove the other samples, but they won't be of much use in the GLM anyway as they will always be fitted perfectly by their own subject-specific term.)
  2. Keep only one sample from each subject (i.e., discard one from each pair of samples from one subject).

The idea here is to avoid dependencies between samples from the same subject, either by modelling them explicitly in the model in approach 1, or by removing some of the relevant samples as in approach 2. Given you have 165 samples, I would recommend approach 2 as it retains most of the information in the data set (only 5 samples lost).

I would also suggest using the Group term in your design matrix rather than adding Source, Sample.Type and Timepoint, as the former parametrization avoids making assumptions about the additivity of the latter effects.

Finally, you mention two time points, but I can see at least one more: Dose1Day2. I'll just ignore this.

ADD COMMENT
0
Entering edit mode

Thank you so much for your reply Aaron. I'm trying to simplify this data set to post here. So here is another subset hopefully better represent the data. Please excuse my naive question as I have not done analysis with subject ID included before

Analysis.Barcode Subject Cycle Source Sample.Type Group
Sample_1 1001-1001 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_2 1001-1002 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_3 1001-1002 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos
Sample_4 1001-1002 C1D8 PBMC CD33_34pos C1D8.PBMC.CD33_34pos
Sample_5 1001-1003 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_6 1001-1003 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos
Sample_7 1001-1003 C1D8 PBMC CD33_34pos C1D8.PBMC.CD33_34pos
Sample_8 1001-1004 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_9 1001-1006 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_10 1001-1008 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_11 1001-1010 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_12 1001-1010 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos
Sample_13 1001-1012 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos
Sample_14 1001-1013 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos
Sample_15 1001-1013 C1D8 PBMC CD33_34pos C1D8.PBMC.CD33_34pos

In order to analyze differences between (1) C1D8 vs. C1D1 and (2) C1D8 vs C1D1, May I ask what is a good way to include the subject ID into the design matrix? I understand that there is only one single Source and Sample.Type here but the actual data set has different Sources and Sample Types

Currently I am trying the following:

Grp <- factor(sampleInfoAll$Group)
Sub <- factor(sampleInfoAll$Subject)
design <- model.matrix(~0+Grp+Sub)

but I'm a little confused looking at the design matrix as to how to make contrasts..or shall I use something more like what is shown in the edgeR manual 3.5? Thanks so much again

 

ADD REPLY
0
Entering edit mode

Your design matrix seems to be correct. Each coefficient with a Grp name should represent the average log-expression of that group (technically of the first subject, but don't worry about that), while the Sub terms represent the log-fold change in expression of every other subject over the first subject. To form contrasts, simply use makeContrasts to compare the relevant levels of Grp, e.g., GrpC1D1.PBMC.CD33_34pos - GrpC1D2.PBMC.CD33_34pos.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

If you use edgeR to do the analysis, you must include Subject in the linear model -- see Aaron's answer.

Personally, I would use limma to analyse an experiment like this and would use duplicateCorrelation() to model the correlation between repeat observations from the same subject. limma is able to compare treatments even when they are not applied to the same subject, in other words it can combine both within subject and between subject comparisons.

The edgeR approach is more conservative. The limma approach recovers a lot of extra information from the between-subject comparisons. It depends on how conservative you want to be.

ADD COMMENT

Login before adding your answer.

Traffic: 1026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6