Question: edgeR in a situation with unbalanced design: to include only subjects with matching time points or not?
0
2.0 years ago by
United States
cafelumiere1220 wrote:

Edited the original question to make it a little clearer:

I have a gene expression data set of 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not.

A subset of this actual data looks like this:

 Sample Subject TimePoint Source Sample.Type_Mod Group 31 1002-1011 Baseline BM CD33_34pos Baseline.BM.CD33_34pos 32 1001-1001 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 33 1001-1006 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 34 1001-1012 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 35 1002-1002 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 63 1002-1014 Dose1Day2 PBMC CD33_34pos Dose1Day2.PBMC.CD33_34pos 64 1001-1016 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos 65 1002-1010 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos 67 1002-1013 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos

I concatenated time point, sample source and sample type as the grouping factor:

design <- model.matrix(~0+TimePoint+Source+Sample.Type)

My question is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.

Furthermore, only 5 subjects are present in both time points.

Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?

Or, do I look at all the available samples within that group and make the contrast looks something like :

Group = factor(sampleInfoAll$Group) design <- model.matrix(~0+Group) my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=Dose1Day8.PBMC.CD33_34pos-Dose1Day1.PBMC.CD33_34pos,levels=design) ADD COMMENTlink modified 2.0 years ago by Gordon Smyth38k • written 2.0 years ago by cafelumiere1220 Can I check, have you shown the whole dataset or is the table above just a subset of the data? If this is just a subset, how many subjects do you have in total? ADD REPLYlink written 2.0 years ago by Gordon Smyth38k Thank you for checking! No this is only a subset of the (simplified) data. The actual data set has 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not. A subset of this actual data looks like this:  Sample Subject TimePoint Source Sample.Type_Mod Group 31 1002-1011 Baseline BM CD33_34pos Baseline.BM.CD33_34pos 32 1001-1001 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 33 1001-1006 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 34 1001-1012 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 35 1002-1002 Dose1Day1 PBMC CD33_34pos Dose1Day1.PBMC.CD33_34pos 63 1002-1014 Dose1Day2 PBMC CD33_34pos Dose1Day2.PBMC.CD33_34pos 64 1001-1016 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos 65 1002-1010 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos 67 1002-1013 Dose1Day8 PBMC CD33_34pos Dose1Day8.PBMC.CD33_34pos I concatenated time point, sample source and sample type as the grouping factor. The challenge is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8. Furthermore, only 5 subjects are present in both time points. Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model? Or, do I look at all the available samples within that group and make the contrast looks something like : Group = factor(sampleInfoAll$Group)
design <- model.matrix(~0+Group)
my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=(1/7)*Dose1Day8.PBMC.CD33_34pos-(1/20)*Dose1Day1.PBMC.CD33_34pos,levels=design)



Just one important point regarding your code snippet. You never need to account for sample sizes, by dividing by 7 or dividing by 20, when forming contrasts in edgeR. edgeR always accounts for sample sizes correctly by itself. If you want to compare Day8 to Day1 you just use Dose1Day8.PBMC.CD33_34pos - Dose1Day1.PBMC.CD33_34pos.

Thank you very much! That's right - I was thinking about group...thank you so much for pointing it out

Answer: edgeR in a situation with unbalanced design: to include only subjects with match
1
2.0 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

You mention that you have multiple samples from the same subject. However, your design matrix does not contain a term for Subject. This will violate the assumption of independence for the random component of the GLM. Based on the subset of samples in your post, I see two approaches to restoring independence:

1. Keep only the subjects with samples from both time points, and add Subject to the design matrix. (Technically there is no need to remove the other samples, but they won't be of much use in the GLM anyway as they will always be fitted perfectly by their own subject-specific term.)
2. Keep only one sample from each subject (i.e., discard one from each pair of samples from one subject).

The idea here is to avoid dependencies between samples from the same subject, either by modelling them explicitly in the model in approach 1, or by removing some of the relevant samples as in approach 2. Given you have 165 samples, I would recommend approach 2 as it retains most of the information in the data set (only 5 samples lost).

I would also suggest using the Group term in your design matrix rather than adding Source, Sample.Type and Timepoint, as the former parametrization avoids making assumptions about the additivity of the latter effects.

Finally, you mention two time points, but I can see at least one more: Dose1Day2. I'll just ignore this.

Thank you so much for your reply Aaron. I'm trying to simplify this data set to post here. So here is another subset hopefully better represent the data. Please excuse my naive question as I have not done analysis with subject ID included before

 Analysis.Barcode Subject Cycle Source Sample.Type Group Sample_1 1001-1001 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_2 1001-1002 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_3 1001-1002 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos Sample_4 1001-1002 C1D8 PBMC CD33_34pos C1D8.PBMC.CD33_34pos Sample_5 1001-1003 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_6 1001-1003 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos Sample_7 1001-1003 C1D8 PBMC CD33_34pos C1D8.PBMC.CD33_34pos Sample_8 1001-1004 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_9 1001-1006 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_10 1001-1008 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_11 1001-1010 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_12 1001-1010 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos Sample_13 1001-1012 C1D1 PBMC CD33_34pos C1D1.PBMC.CD33_34pos Sample_14 1001-1013 C1D2 PBMC CD33_34pos C1D2.PBMC.CD33_34pos Sample_15 1001-1013 C1D8 PBMC CD33_34pos C1D8.PBMC.CD33_34pos

In order to analyze differences between (1) C1D8 vs. C1D1 and (2) C1D8 vs C1D1, May I ask what is a good way to include the subject ID into the design matrix? I understand that there is only one single Source and Sample.Type here but the actual data set has different Sources and Sample Types

Currently I am trying the following:

Grp <- factor(sampleInfoAll$Group) Sub <- factor(sampleInfoAll$Subject)
design <- model.matrix(~0+Grp+Sub)

but I'm a little confused looking at the design matrix as to how to make contrasts..or shall I use something more like what is shown in the edgeR manual 3.5? Thanks so much again

Your design matrix seems to be correct. Each coefficient with a Grp name should represent the average log-expression of that group (technically of the first subject, but don't worry about that), while the Sub terms represent the log-fold change in expression of every other subject over the first subject. To form contrasts, simply use makeContrasts to compare the relevant levels of Grp, e.g., GrpC1D1.PBMC.CD33_34pos - GrpC1D2.PBMC.CD33_34pos.

Thank you very much!

Answer: edgeR in a situation with unbalanced design: to include only subjects with match
0
2.0 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

If you use edgeR to do the analysis, you must include Subject in the linear model -- see Aaron's answer.

Personally, I would use limma to analyse an experiment like this and would use duplicateCorrelation() to model the correlation between repeat observations from the same subject. limma is able to compare treatments even when they are not applied to the same subject, in other words it can combine both within subject and between subject comparisons.

The edgeR approach is more conservative. The limma approach recovers a lot of extra information from the between-subject comparisons. It depends on how conservative you want to be.