Question

Best way to design a contrast matrix for analysis of dataset

0

Entering edit mode

Nithisha ▴ 10

@nithisha-14272

Last seen 6.2 years ago

Hi all,

I have this data set https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92776, in which there are 337 blood samples from patients. This is from a Lupus study and so Group A consists of 41 patients who were waned off their Standard of Care (SoC) medications till a flare (aggravation of SLE) occurs, Group B consists of 61 SLE patients who provide a normal blood sample only, and Group C consists of 55 healthy volunteers.

In a nutshell, the un-normalized gene counts provided are as follows:

The 337 samples are organized as columns while genes are organized as row names.

The sample columns from Group A consist of a Baseline profile (column), an improving/other profile and they all end with a Flare profile.

The sample columns from Group B consist of 1 Baseline Profile from each patient only.

The sample columns from Group C consist of 2 profiles from each volunteer over a period of time.

Therefore columns can look like this:

If my objective was to find out what genes are enriched in SLE patients as compared to healthy individuals (baseline/flare conditions), what is the best way to do this?

I stored all the additional information regarding each sample (Group_ID, Donor_ID etc.) as phenoData when creating an expression set object in R for Limma DEG analysis.

My countData looks like this:

              B0100V1.BASELINE B0101V1.BASELINE B0102V1.BASELINE 
ABCA1         26.73609         27.60309         28.08309               
ACP1          28.38355         27.44355         26.71755

and my phenoData looks like this (part of it):

                  patient_and_visit     profile           donor_id       group_id 
B0100V1.BASELINE        100             Baseline            100            B             
B0101V1.BASELINE        101             Baseline            101            B

So far, I have been thinking that that perhaps I should compare Group B patients (Baseline) to Group C (healthy):

groupBC <- es[,es$group_id == 'B'|es$group_id == 'C'] unique(groupBC$profile) design <- model.matrix(~0+groupBC$profile) colnames(design) <- c('Baseline', 'Flare', 'Healthy', 'Improving', 'Other') fit <- lmFit(groupBC,design) contrast.matrix <- makeContrasts(Baseline-Healthy, levels=design) fit2 <- contrasts.fit(fit,contrast.matrix) fit2 <- eBayes(fit2)

#compare Baseline-Healthy results<-topTable(fit2, coef=1, adjust='BH', n=Inf)

Would you all have better suggestions of creating contrast matrices to reach the objective I want? Would I be able to compare the gene profiles per patient in Group A for different conditions (Baseline, Improving, Flare etc.) ?

I apologize for the long question, but any advice would be much appreciated!

Limma • 1.7k views

ADD COMMENT • link 6.2 years ago Nithisha ▴ 10

0

Entering edit mode

Nithisha ▴ 10

@nithisha-14272

Last seen 6.2 years ago

Hi Aaron,

Thank you so much for pointing this out to me, it is very helpful indeed! However, when I go back to the Limma manual, it says "Since we need to make comparisons both within and between subjects, it is necessary to treat Patient as a random effect" on page 49 and I was wondering if you could find an easier explanation for this. Also, could you advise on what block=es$donor_id and correlation=dupcor$consensus are used for in the command fit?

Once I do:

groups <- factor(paste0(es$profile, ".", es$group_id))

and then :

design <- model.matrix(~0 + groups)

I get the following error:

Error in model.frame.default(object, data, xlev = xlev) : 
  variable lengths differ (found for 'groups')

These are the levels I have for groups:

> unique(groups)
Levels: BASELINE.A BASELINE.B FLARE.A HEALTHY.C IMPROVING.A OTHER.A

Could you advise me on what I might be doing wrong? Thanks once again.

ADD COMMENT • link 6.2 years ago Nithisha ▴ 10

0

Entering edit mode

Please respond to existing answers using the "Add comment" button, unless you're answering your own question.

Anyway: the results you've shown don't make much sense to me.

I don't understand how you managed to get a "variable lengths differ" error when you only have one variable (groups) in your model.matrix call. I can only suggest making sure that groups is non-empty and contains no NA values, though these shouldn't be the cause of the problem.
Calling unique on a factor should give a vector of unique levels, as well as a listing of Levels. Why are you only getting the Levels listing? Are you copy-pasting the output correctly?

> a <- factor(LETTERS, level=LETTERS)
> unique(a)
 [1] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

ADD REPLY • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

I am not sure why the error pops up.

I first entered the command:

groups <- factor(paste0(es$profile,'.',es$group_id))'

which gives me:

> groups
  [1] BASELINE.B  BASELINE.B  BASELINE.B  BASELINE.B  BASELINE.B  BASELINE.B  BASELINE.B  BASELINE.B
  [9] BASELINE.B  BASELINE.B  BASELINE.A  OTHER.A     OTHER.A     IMPROVING.A FLARE.A     BASELINE.A
[17] OTHER.A     IMPROVING.A FLARE.A     BASELINE.A  OTHER.A     OTHER.A     IMPROVING.A FLARE.A   
[25] BASELINE.B  BASELINE.A  OTHER.A     IMPROVING.A FLARE.A     BASELINE.A  OTHER.A     OTHER.A

and then:

design <- model.matrix(0~ + groups)

which gives the same error:

Error in model.frame.default(object, data, xlev = xlev) : 
  variable lengths differ (found for 'groups')

Just to check, when I enter:

> unique(groups)
[1] BASELINE.B BASELINE.A OTHER.A IMPROVING.A FLARE.A HEALTHY.C

Levels: BASELINE.A BASELINE.B FLARE.A HEALTHY.C IMPROVING.A OTHER.A

There do not seem to be any NA values in groups.

ADD REPLY • link 6.2 years ago Nithisha ▴ 10

0

Entering edit mode

You should be doing ~ 0 + groups.

ADD REPLY • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron, I am so sorry about this oversight, I did not realize. Thanks for your patience!

ADD REPLY • link 6.2 years ago Nithisha ▴ 10

score 3 · Accepted Answer · 2018-01-27

Instead of slicing up your data set for every comparison of interest, you can do:

groups <- paste0(es$profile, ".", es$group_id)
design <- model.matrix(~0 + groups)
dupcor <- duplicateCorrelation(es, design, batch=es$donor_id)
fit <- lmFit(es, design, block=es$donor_id, 
    correlation=dupcor$consensus)

You can then set up your contrasts as you would normally do:

makeContrasts(groupsBaseline.B - groupsHealthy.C, levels=design) # etc.

We use duplicateCorrelation as this accounts for the correlations between samples from the same individual in groups A and C. Otherwise, the precision of our estimates would be overstated if we assumed all samples to be independent.

Of course, manually subsetting your dataset will also eliminate the correlations, provided only one sample from each patient is in the subset. The advantage of the duplicateCorrelation approach is that we don't need to slice our dataset for each comparison, which is (i) easier for the analyst to handle, especially if you have a lot of comparisons to perform; and (ii) allows us to share information across many samples to improve our estimate of the gene-specific sample variance.