Question

edgeR contrast formulation question

0

Entering edit mode

mbio.kyle • 0

@mbiokyle-8186

Last seen 8.2 years ago

United States

I am working with edgeR to perform some differential expression work. EdgeR has been a dream to work with and I have quickly reached the point where I am trying to ask some more complicated questions.

My dataset consists of the following:

I have a human cell line, which is either infected or uninfected with a pathogen. I also have two different drug treatments (A and B for simplicity). Therefore I have 6 conditions (with triplicate reps). I have successfully made the simple comparisons like: what genes are changing when uninfected cells are treated with drug A, and what genes are changing when cells are infected with the pathogen?

My more complicated questions take the form:
What gene changes are induced when infected cells are treated with drug A vs when uninfected cells are treated with drug A?

Basically, the hypothesis is that the drugs affect the pathogens, causing them to act differently and we want to find the host gene changes caused by the changed pathogen behavior.

Here is what my samples/targets look like:

                            Pathogen    Treatment    Group
No_Pathogen-No_drug-rep1    negative    negative    negative.negative
No_Pathogen-No_drug-rep2    negative    negative    negative.negative
No_Pathogen-No_drug-rep3    negative    negative    negative.negative
No_Pathogen-Drug_A-rep1        negative    Drug_A    negative.Drug_A
No_Pathogen-Drug_A-rep2        negative    Drug_A    negative.Drug_A
No_Pathogen-Drug_A-rep3        negative    Drug_A    negative.Drug_A
No_Pathogen-Drug_B-rep1        negative    Drug_B    negative.Drug_B
No_Pathogen-Drug_B-rep2        negative    Drug_B    negative.Drug_B
No_Pathogen-Drug_B-rep3        negative    Drug_B    negative.Drug_B
Pathogen-No_drug-rep1        positive    negative    positive.negative
Pathogen-No_drug-rep2        positive    negative    positive.negative
Pathogen-No_drug-rep3        positive    negative    positive.negative
Pathogen-Drug_A-rep1        positive    Drug_A    positive.Drug_A
Pathogen-Drug_A-rep2        positive    Drug_A    positive.Drug_A
Pathogen-Drug_A-rep3        positive    Drug_A    positive.Drug_A
Pathogen-Drug_B-rep1        positive    Drug_B    positive.Drug_B
Pathogen-Drug_B-rep2        positive    Drug_B    positive.Drug_B
Pathogen-Drug_B-rep3        positive    Drug_B    positive.Drug_B

I am using the makeContrasts method to set up the comparisons. I am wondering specifically about something like this contrast:

Pathogen_induced_changes_with_drug_A = (positive.DrugA) - (negative.DrugA - negative.negative)

My intention with this contrast is to have the baseline be changes in pathogen negative cells when DrugA is added, and then compare that to changes in pathogen positive cells with DrugA to find the gene changes I mentioned above.

I am wondering if I shouldn't do this instead:

Pathogen_induced_changes_with_drug_A = (positive.DrugA - positive.negative) - (negative.DrugA - negative.negative)

To also remove changes which come from simply the presence of the pathogen.

I would really appreciate any tips / pointers on this.

Thanks!

edger rnaseq differential expression • 1.5k views

ADD COMMENT • link updated 8.6 years ago by Ryan C. Thompson ★ 7.9k • written 8.6 years ago by mbio.kyle • 0

score 2 · Answer 1 · 2015-09-30

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

First of all, remember that how you write your contrasts depends on how you have parametrized your design matrix. Based on the contrasts you are presenting, I'm assuming you used ~0 + Group, so that each coefficient represents the average logCPM of one group. If so, then your second suggestion is the correct one. You want to test the difference in response to the drug in infected vs uninfected samples, so this is a "contrast of contrasts" and is known as an interaction (between drug A and infection). However, interpreting the results will likely require you to compare to the results of other contrasts. For example, if gene X has a positive logFC in this interaction contrast, it could be because the gene is unchanged in uninfected samples but goes up in infected samples with drug A treatment, or it could be that it is unchanged in infected samples and goes down in uninfected samples, or any number of other possibilities. So for genes with significant interactions, you have to examine the individual contrasts to understand why the interaction is significant.

ADD COMMENT • link 8.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hello Ryan,

Thanks for this response. I am not specifying my design using the ~0 + Group format. I am going to just include my whole R script. I am not opposed to digging into the results manually but I would like to clarify if what I am doing initially is even correct.

library("edgeR")

countFile <- read.delim("counts.tsv", row.names="Gene")
targets <- read.table("targets.tsv")
group <- factor(paste(targets$Pathogen,targets$Treatment,sep="."))
cbind(targets, Group=group)
design <- model.matrix(~group)
colnames(design) <- levels(group)
baseDGEList <- DGEList(counts=countFile, group=group)

# filter out low level expressed genes
dim(baseDGEList)
keep <- rowSums(cpm(baseDGEList) > 1) >= 2
baseDGEList <- baseDGEList[keep,]
dim(baseDGEList)

# update sizes and apply TMM norm
baseDGEList$samples$lib.size <- colSums(baseDGEList$counts)
baseDGEList <- calcNormFactors(baseDGEList)

# estimate CommonDisp for GLM
baseDGEList <- estimateGLMCommonDisp(baseDGEList, design)

# last make the linear fit
fit <- glmFit(baseDGEList, design)

# lets try
my.contrasts <- makeContrasts(

    # simple comparisons #

    # pathogen (the first positive)
    pathogen_difference = positive.negative - negative.negative,

    # treatments
    a_difference = negative.A - negative.negative,
    b_difference = negative.B - negative.negative,

    # complicated
    A_vs_B = (negative.A - negative.negative) - (negative.B - negative.negative),
    pathogen_A_vs_B = (positive.A - positive.negative) - (positive.B - positive.negative),
    levels=design
)

# get results
path <- glmLRT(fit, contrast=my.contrasts[,"pathogen_difference"])
a <- glmLRT(fit, contrast=my.contrasts[,"a_difference"])
b <- glmLRT(fit, contrast=my.contrasts[,"b_difference"])
a_b <- glmLRT(fit, contrast=my.contrasts[,"A_vs_B"])
p_a_b <- glmLRT(fit, contrast=my.contrasts[,"pathogen_A_vs_B"])

For A_vs_B comparisons, are those valid (and just require some cross checking) or should I only make simple comparisons and merge everything manually?
If I am not making sense, I apologise.

Thanks again for all your help

ADD REPLY • link 8.6 years ago mbio.kyle • 0

1

Entering edit mode

You have made a mistake here:

design <- model.matrix(~group)
colnames(design) <- levels(group)

Your design has an intercept as the first column and fold changes as the other 5, but you are naming them as if they all had the same meaning. (Look at your design matrix and note that the first column looks different from the rest.) This is causing you to misinterpret the meaning of the coefficients and is the reason for your confusion. I strongly recommend you use a no-intercept design (i.e. ~0 + Group) as I described earlier. If you use a no-intercept model, the column names you are assigning to the design matrix will match the meaning of those columns and the contrasts you have formulated in your code should all be correct. (The edgeR user's guide has a section on formulating a model with no intercept.)

Furthermore, the last two contrasts in your code, labeled "Complicated", are not actually interaction contrasts. For example, note that the - negative.negative part cancels out of both terms in the first contrasts, leaving you with just negative.A - negative.B. The same is true for the next one. An interaction contrast involves 4 distinct groups, not 3. (The contrasts you have written are still valid, since the extra parts just cancel out.)

Also, unrelated to your contrast formulation, your choice to use common dispersion is a bit odd, unless you're working with a very small number of genes. If you're working with RNA-seq or similar genome-scale data, I would recommend you use estimateDisp, glmQLFit, and glmQLFTest instead to run the standard quasi-likelihood method.

ADD REPLY • link 8.6 years ago Ryan C. Thompson ★ 7.9k