Question

edgeR in a situation with unbalanced design: to include only subjects with matching time points or not?

0

Entering edit mode

cafelumiere12 ▴ 20

@cafelumiere12-7513

Last seen 6.1 years ago

United States

Edited the original question to make it a little clearer:

I have a gene expression data set of 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not.

A subset of this actual data looks like this:

Sample	Subject	TimePoint	Source	Sample.Type_Mod	Group
31	1002-1011	Baseline	BM	CD33_34pos	Baseline.BM.CD33_34pos
32	1001-1001	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
33	1001-1006	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
34	1001-1012	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
35	1002-1002	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
63	1002-1014	Dose1Day2	PBMC	CD33_34pos	Dose1Day2.PBMC.CD33_34pos
64	1001-1016	Dose1Day8	PBMC	CD33_34pos	Dose1Day8.PBMC.CD33_34pos
65	1002-1010	Dose1Day8	PBMC	CD33_34pos	Dose1Day8.PBMC.CD33_34pos
67	1002-1013	Dose1Day8	PBMC	CD33_34pos	Dose1Day8.PBMC.CD33_34pos

I concatenated time point, sample source and sample type as the grouping factor:

design <- model.matrix(~0+TimePoint+Source+Sample.Type)

My question is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.

Furthermore, only 5 subjects are present in both time points.

Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?

Or, do I look at all the available samples within that group and make the contrast looks something like :

Group = factor(sampleInfoAll$Group)
design <- model.matrix(~0+Group)
my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=Dose1Day8.PBMC.CD33_34pos-Dose1Day1.PBMC.CD33_34pos,levels=design)

edgeR differential gene expression • 1.7k views

ADD COMMENT • link updated 6.7 years ago by Gordon Smyth 50k • written 6.7 years ago by cafelumiere12 ▴ 20

0

Entering edit mode

Can I check, have you shown the whole dataset or is the table above just a subset of the data? If this is just a subset, how many subjects do you have in total?

ADD REPLY • link 6.7 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you for checking! No this is only a subset of the (simplified) data. The actual data set has 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not.

A subset of this actual data looks like this:

Sample	Subject	TimePoint	Source	Sample.Type_Mod	Group
31	1002-1011	Baseline	BM	CD33_34pos	Baseline.BM.CD33_34pos
32	1001-1001	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
33	1001-1006	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
34	1001-1012	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
35	1002-1002	Dose1Day1	PBMC	CD33_34pos	Dose1Day1.PBMC.CD33_34pos
63	1002-1014	Dose1Day2	PBMC	CD33_34pos	Dose1Day2.PBMC.CD33_34pos
64	1001-1016	Dose1Day8	PBMC	CD33_34pos	Dose1Day8.PBMC.CD33_34pos
65	1002-1010	Dose1Day8	PBMC	CD33_34pos	Dose1Day8.PBMC.CD33_34pos
67	1002-1013	Dose1Day8	PBMC	CD33_34pos	Dose1Day8.PBMC.CD33_34pos

I concatenated time point, sample source and sample type as the grouping factor.

The challenge is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.

Furthermore, only 5 subjects are present in both time points.

Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?

Or, do I look at all the available samples within that group and make the contrast looks something like :

Group = factor(sampleInfoAll$Group)
design <- model.matrix(~0+Group)
my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=(1/7)*Dose1Day8.PBMC.CD33_34pos-(1/20)*Dose1Day1.PBMC.CD33_34pos,levels=design)

ADD REPLY • link 6.7 years ago cafelumiere12 ▴ 20

0

Entering edit mode

Just one important point regarding your code snippet. You never need to account for sample sizes, by dividing by 7 or dividing by 20, when forming contrasts in edgeR. edgeR always accounts for sample sizes correctly by itself. If you want to compare Day8 to Day1 you just use Dose1Day8.PBMC.CD33_34pos - Dose1Day1.PBMC.CD33_34pos.

ADD REPLY • link 6.7 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you very much! That's right - I was thinking about group...thank you so much for pointing it out

ADD REPLY • link 6.7 years ago cafelumiere12 ▴ 20

score 1 · Answer 1 · 2017-08-04

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 11 hours ago

The city by the bay

You mention that you have multiple samples from the same subject. However, your design matrix does not contain a term for Subject. This will violate the assumption of independence for the random component of the GLM. Based on the subset of samples in your post, I see two approaches to restoring independence:

Keep only the subjects with samples from both time points, and add Subject to the design matrix. (Technically there is no need to remove the other samples, but they won't be of much use in the GLM anyway as they will always be fitted perfectly by their own subject-specific term.)
Keep only one sample from each subject (i.e., discard one from each pair of samples from one subject).

The idea here is to avoid dependencies between samples from the same subject, either by modelling them explicitly in the model in approach 1, or by removing some of the relevant samples as in approach 2. Given you have 165 samples, I would recommend approach 2 as it retains most of the information in the data set (only 5 samples lost).

I would also suggest using the Group term in your design matrix rather than adding Source, Sample.Type and Timepoint, as the former parametrization avoids making assumptions about the additivity of the latter effects.

Finally, you mention two time points, but I can see at least one more: Dose1Day2. I'll just ignore this.

ADD COMMENT • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you so much for your reply Aaron. I'm trying to simplify this data set to post here. So here is another subset hopefully better represent the data. Please excuse my naive question as I have not done analysis with subject ID included before

Analysis.Barcode	Subject	Cycle	Source	Sample.Type	Group
Sample_1	1001-1001	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_2	1001-1002	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_3	1001-1002	C1D2	PBMC	CD33_34pos	C1D2.PBMC.CD33_34pos
Sample_4	1001-1002	C1D8	PBMC	CD33_34pos	C1D8.PBMC.CD33_34pos
Sample_5	1001-1003	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_6	1001-1003	C1D2	PBMC	CD33_34pos	C1D2.PBMC.CD33_34pos
Sample_7	1001-1003	C1D8	PBMC	CD33_34pos	C1D8.PBMC.CD33_34pos
Sample_8	1001-1004	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_9	1001-1006	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_10	1001-1008	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_11	1001-1010	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_12	1001-1010	C1D2	PBMC	CD33_34pos	C1D2.PBMC.CD33_34pos
Sample_13	1001-1012	C1D1	PBMC	CD33_34pos	C1D1.PBMC.CD33_34pos
Sample_14	1001-1013	C1D2	PBMC	CD33_34pos	C1D2.PBMC.CD33_34pos
Sample_15	1001-1013	C1D8	PBMC	CD33_34pos	C1D8.PBMC.CD33_34pos

In order to analyze differences between (1) C1D8 vs. C1D1 and (2) C1D8 vs C1D1, May I ask what is a good way to include the subject ID into the design matrix? I understand that there is only one single Source and Sample.Type here but the actual data set has different Sources and Sample Types

Currently I am trying the following:

Grp <- factor(sampleInfoAll$Group)
Sub <- factor(sampleInfoAll$Subject)
design <- model.matrix(~0+Grp+Sub)

but I'm a little confused looking at the design matrix as to how to make contrasts..or shall I use something more like what is shown in the edgeR manual 3.5? Thanks so much again

ADD REPLY • link 6.7 years ago cafelumiere12 ▴ 20

0

Entering edit mode

Your design matrix seems to be correct. Each coefficient with a Grp name should represent the average log-expression of that group (technically of the first subject, but don't worry about that), while the Sub terms represent the log-fold change in expression of every other subject over the first subject. To form contrasts, simply use makeContrasts to compare the relevant levels of Grp, e.g., GrpC1D1.PBMC.CD33_34pos - GrpC1D2.PBMC.CD33_34pos.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 6.7 years ago cafelumiere12 ▴ 20

score 0 · Answer 2 · 2017-08-06

If you use edgeR to do the analysis, you must include Subject in the linear model -- see Aaron's answer.

Personally, I would use limma to analyse an experiment like this and would use duplicateCorrelation() to model the correlation between repeat observations from the same subject. limma is able to compare treatments even when they are not applied to the same subject, in other words it can combine both within subject and between subject comparisons.

The edgeR approach is more conservative. The limma approach recovers a lot of extra information from the between-subject comparisons. It depends on how conservative you want to be.