Question

Differential expression analysis - additive genetic model (how to implement it)

0

Entering edit mode

jlerga • 0

@jlerga-9341

Last seen 8.9 years ago

Hello everyone:

I have a question related to the analysis of those genes that are differentially expressed depending on the genotype, employing different programs such as DESeq(2) or edgeR; specifically, my question is about how I can implement an additive genetic model for these analyses.

I read the manuals of DESeq and edgeR, and I understand pretty well how to make simple comparisons, but I did not figure out how to apply an additive model.

General idea of the experiment

I am studying a specific human polymorphic inversion (that is, a segment of DNA that can be inverted in some individuals). An inversion is, therefore, like an allele, and each individual could be A/A (inverted allele/inverted allele), A/B (inverted/standard), B/B (standard/standard). So, inversions can be treated simply as SNPs or other type of alleles.

I have the genotype of several individuals and also their RNA-seq data. Therefore, I have two tables:

* A table of counts: the expression counts of each gene in each individual

	Individual A	Individual B	Individual C
Gene A	4	50	30
Gene B	1	24	24
Gene C	34	51	27

... (with more genes and more individuals)

* The design:

*Individuals*	*Population*	*Sex*	*Genotype*
Individual A	European	Male	A/A
Individual B	African	Female	A/B
Individual C	African	Male	B/B

... (with many more individuals)

Objective

We want to discover those genes that are differentially expressed depending on the genotype of the inversion.

Problem: Should I use an additive model?

I guess that to determine the appropriate test for association, we must first specify a genetic model. Currently, it is quite established in the scientific community that an additive model is the most correct model that should be used (not dominance, recessive or general model), which impose a structure in which each additional copy of the variant allele increases the response by the same amount.

In other words, if there are two alleles at a gene locus, then the additive effect (A/B) is half of the the difference between the mean of all cases that are homozygous for one version of the allele (A/A) compared to the mean of all cases that are homozygous for the other allele (B/B):

Genotypes	A/A	A/B	B/B
Effect?	0	1	2

I read that a flexible way of carrying out such tests is by use of logistic regression. I do not know if I can carry out a logistic regression with the counts of a specific gene as the outcome variable and genotype as an explanatory variable. Anyway, I do not how to implement this in a particular program such as DESeq or edgeR.

Confounding factors

I have individuals from three populations and two genders. Therefore, I should correct the analysis for these important confounding factors, shouldn't I? I will put these factors in the design formula, this is clear.

R code (DESeq and edgeR)

I put here the R code I used; however, I did three comparations (A/A vs A/B, A/B vs B/B, and B/B vs A/A), which is not exactly what I wanted.

I guess that the important comparison should be A/A vs B/B, because if we assume an additive model, the greatest differences should be captured in this comparison. However, if we use a real additive model (with your help) maybe it should be more sensible and I will have more power to capture real differentially expressed genes.

I hope you can help me!
Many many thanks in advance!!!!

DESeq code

pairs<-combn(levels(design$Condition),2)

dds<-DESeqDataSetFromMatrix(countData=counts,
colData=design,
design=~Pop+Sex+Condition)

dds<-dds[rowSums(counts(dds))>1, ]

dds<-DESeq(dds)

list_comp_deseq<-list()
for(s in 1:length(pairs[1,])){
first_c<-pairs[1,s]
second_c<-pairs[2,s]
res<-results(dds,contrast=c("Condition",first_c,second_c))
res<-res[!is.na(res$padj),]
res<-res[order(res$padj),]
list_comp_deseq[[s]]<-res
write.table(res, file=paste("DESeq_",first_c,"_",second_c,sep=""))
}

save(dds,file="dds")

edgeR code

y<-DGEList(counts,group=design$Condition)

keep <- rowSums(cpm(y)>1) >= 20
y <- y[keep, , keep.lib.sizes=FALSE]

y<-calcNormFactors(y)
design_edge<-model.matrix(~Sex+Pop+Condition,design)

y<-estimateDisp(y,design_edge)

fit<-glmFit(y,design_edge)

numbers<-grep("Condition",colnames(fit$coefficients))
pairs_edge<-list()
counter<-1
for(u in numbers[1]:numbers[length(numbers)]){
cond<-substr(colnames(fit$coefficients)[u],10,12)
cond<-c(cond,u)
pairs_edge[[counter]]<-cond
counter<-counter+1
}
names_edge<-names(table(design$Condition))

if(length(names_edge)==2){
if(names_edge[1]==pairs_edge[[1]][1]){
    intercept<-names_edge[2]
}else{
    intercept<-names_edge[1]
}
}else{
if(names_edge[1]!=pairs_edge[[1]][1] && names_edge[1]!=pairs_edge[[2]][1]){
    intercept<-names_edge[1]

}else if(names_edge[2]!=pairs_edge[[1]][1] && names_edge[2]!=pairs_edge[[2]][1]){
    intercept<-names_edge[2]

}else{
    intercept<-names_edge[3]

}
}

list_comp_edge<-list()
for(s in 1:length(pairs_edge)){

lrt<- glmLRT(fit, coef=as.numeric(pairs_edge[[s]][2]))
write.table(topTags(lrt,n=50000), file=paste("edgeR_",pairs_edge[[s]][1],"_",intercept,sep=""))
list_comp_edge[[s]]<-topTags(lrt,n=50000)

if(s==2){
    vector<-rep(0,as.numeric(pairs_edge[[2]][2]))
    vector[as.numeric(pairs_edge[[1]][2])]<-(-1)
    vector[as.numeric(pairs_edge[[2]][2])]<-(1)
    lrt<- glmLRT(fit, contrast=vector)
    write.table(topTags(lrt,n=50000), file=paste("edgeR_",pairs_edge[[1]][1],"_",pairs_edge[[2]][1],sep=""))
    list_comp_edge[[3]]<-topTags(lrt,n=50000)
}
}

save(fit,file="fit")

Thanks in advance again!!!!!!!!

Jon

deseq2 edger differential gene expression additive model • 3.6k views

ADD COMMENT • link 9.0 years ago • updated 8.9 years ago jlerga • 0

0

Entering edit mode

jlerga • 0

@jlerga-9341

Last seen 8.9 years ago

What about: (in edgeR)?

> my.contrast <- makeContrasts((AA+(1/2)*AB)-(BB+(1/2)*AB), levels=design)
> lrt <- glmLRT(fit, contrast=my.contrast)

??

ADD COMMENT • link 8.9 years ago jlerga • 0

0

Entering edit mode

This won't do anything, as the AB terms will cancel out:

(AA+(1/2)*AB)-(BB+(1/2)*AB) = AA + (1/2)*AB - BB - (1/2)*AB = AA - BB

ADD REPLY • link 8.9 years ago Aaron Lun ★ 28k

score 5 · Accepted Answer · 2015-12-09

Currently, it is quite established in the scientific community that an additive model is the most correct model that should be used (not dominance, recessive or general model).

Really? I can easily imagine situations where gene expression will not exhibit a linear response to dosage, so an additive model with respect to your genotype would do rather poorly. In fact, it's arguable that genes without a non-linear responses are probably more relevant, as they are probably being subjected to interesting regulatory processes. You'd overlook them if you forced everyone into an additive set-up.

I would stick with what you've already got, where you set up a design matrix with one coefficient per genotype. This is more flexible than an additive model; not just for DE testing, but also in dispersion estimation, where an overly restrictive model will result in inflated estimates of the dispersion and loss of DE detection power.

I have individuals from three populations and two genders. Therefore, I should correct the analysis for these important confounding factors, shouldn't I?

Well, perhaps you should look at the MDS or PCA plot and see if the samples are segregating according to population and sex. If not, then you don't need to do anything, but if they are, then you should block on them in the design matrix as you've already done:

design_edge<-model.matrix(~Sex+Pop+Condition,design)

Note that blocking in the design only works if your genotypes are mixed across the confounding factors. If, for example, all A/A samples are male, then you won't be able to distinguish between the effect of sex and that of the A/A genotype.

I put here the R code I used; however, I did three comparations (A/A vs A/B, A/B vs B/B, and B/B vs A/A), which is not exactly what I wanted.

You've got too much code and it's difficult to read, so I'm just going to use your three-sample example table above to demonstrate. Assume that the table is stored in design. We do some editing to make it nicer, and then we make our design matrix:

design$Genotype <- sub("/", "", design$Genotype)
design.edge <- model.matrix(~0 + Population + Sex + Genotype, design)

Running colnames(design.edge) gives us:

[1] "PopulationAfrican"  "PopulationEuropean" "SexMale"           
[4] "GenotypeAB"         "GenotypeBB"

The first three coefficients represent the blocking factors for the confounding variables, and can be ignored for the sake of simplicity in interpretation. The last two coefficients represent the log-fold change of AB over AA and BB over AA, respectively. If you want to test for AB vs AA or BB vs AA, you can do that with coef=4 or coef=5 to drop the respective coefficients in glmLRT. If you want to test for BB vs AB, you can do that by setting the contrast argument in glmLRT to:

con.bbvab <- makeContrasts(GenotypeBB-GenotypeAB, levels=design.edge)

If you want to test for any DE across the genotypes, you can do that with an ANOVA-like test by specifying coef=4:5. This is probably the most general test that will give you all genes that respond to changes in genotype. If you want to test for non-additive changes in expression with respect to the genotype (i.e., allele A dosage), you need to do something like:

con.nonadd <- makeContrasts(GenotypeBB/2 - GenotypeAB, levels=design.edge)

This asks if the AB vs. AA log-fold change is equal to half the BB vs. AA log-fold change, which is what you would expect under an additive model. Conversely, you can informally identify genes that exhibit additive effects by looking for those genes that are not significant under con.nonadd, but are still significant for the individual pairwise comparisons between genotypes.

P.S. Note that n=Inf will retrieve all genes in topTags. This is safer than using an arbitrarily large number - in your case, you use 50000, but this would fail to return everything if you actually had more than 50000 genes.

P.P.S. Note that design.edge has more coefficients than samples, such that the former values cannot be estimated. Hopefully, your real data set has more samples so that this is not a problem. For future reference, it is better to post an example that is big enough to represent what you're working with. For the majority of cases (and especially in this case), a table with three samples is too small.