Hello everyone:
I have a question related to the analysis of those genes that are differentially expressed depending on the genotype, employing different programs such as DESeq(2) or edgeR; specifically, my question is about how I can implement an additive genetic model for these analyses.
I read the manuals of DESeq and edgeR, and I understand pretty well how to make simple comparisons, but I did not figure out how to apply an additive model.
General idea of the experiment
I am studying a specific human polymorphic inversion (that is, a segment of DNA that can be inverted in some individuals). An inversion is, therefore, like an allele, and each individual could be A/A (inverted allele/inverted allele), A/B (inverted/standard), B/B (standard/standard). So, inversions can be treated simply as SNPs or other type of alleles.
I have the genotype of several individuals and also their RNA-seq data. Therefore, I have two tables:
* A table of counts: the expression counts of each gene in each individual
Individual A | Individual B | Individual C | |
Gene A | 4 | 50 | 30 |
Gene B | 1 | 24 | 24 |
Gene C | 34 | 51 | 27 |
... (with more genes and more individuals)
* The design:
Individuals | Population | Sex | Genotype |
Individual A | European | Male | A/A |
Individual B | African | Female | A/B |
Individual C | African | Male | B/B |
... (with many more individuals)
Objective
We want to discover those genes that are differentially expressed depending on the genotype of the inversion.
Problem: Should I use an additive model?
I guess that to determine the appropriate test for association, we must first specify a genetic model. Currently, it is quite established in the scientific community that an additive model is the most correct model that should be used (not dominance, recessive or general model), which impose a structure in which each additional copy of the variant allele increases the response by the same amount.
In other words, if there are two alleles at a gene locus, then the additive effect (A/B) is half of the the difference between the mean of all cases that are homozygous for one version of the allele (A/A) compared to the mean of all cases that are homozygous for the other allele (B/B):
Genotypes | A/A | A/B | B/B |
Effect? | 0 | 1 | 2 |
I read that a flexible way of carrying out such tests is by use of logistic regression. I do not know if I can carry out a logistic regression with the counts of a specific gene as the outcome variable and genotype as an explanatory variable. Anyway, I do not how to implement this in a particular program such as DESeq or edgeR.
Confounding factors
I have individuals from three populations and two genders. Therefore, I should correct the analysis for these important confounding factors, shouldn't I? I will put these factors in the design formula, this is clear.
R code (DESeq and edgeR)
I put here the R code I used; however, I did three comparations (A/A vs A/B, A/B vs B/B, and B/B vs A/A), which is not exactly what I wanted.
I guess that the important comparison should be A/A vs B/B, because if we assume an additive model, the greatest differences should be captured in this comparison. However, if we use a real additive model (with your help) maybe it should be more sensible and I will have more power to capture real differentially expressed genes.
I hope you can help me!
Many many thanks in advance!!!!
DESeq code
pairs<-combn(levels(design$Condition),2)
dds<-DESeqDataSetFromMatrix(countData=counts,
colData=design,
design=~Pop+Sex+Condition)
dds<-dds[rowSums(counts(dds))>1, ]
dds<-DESeq(dds)
list_comp_deseq<-list()
for(s in 1:length(pairs[1,])){
first_c<-pairs[1,s]
second_c<-pairs[2,s]
res<-results(dds,contrast=c("Condition",first_c,second_c))
res<-res[!is.na(res$padj),]
res<-res[order(res$padj),]
list_comp_deseq[[s]]<-res
write.table(res, file=paste("DESeq_",first_c,"_",second_c,sep=""))
}
save(dds,file="dds")
edgeR code
y<-DGEList(counts,group=design$Condition)
keep <- rowSums(cpm(y)>1) >= 20
y <- y[keep, , keep.lib.sizes=FALSE]
y<-calcNormFactors(y)
design_edge<-model.matrix(~Sex+Pop+Condition,design)
y<-estimateDisp(y,design_edge)
fit<-glmFit(y,design_edge)
numbers<-grep("Condition",colnames(fit$coefficients))
pairs_edge<-list()
counter<-1
for(u in numbers[1]:numbers[length(numbers)]){
cond<-substr(colnames(fit$coefficients)[u],10,12)
cond<-c(cond,u)
pairs_edge[[counter]]<-cond
counter<-counter+1
}
names_edge<-names(table(design$Condition))
if(length(names_edge)==2){
if(names_edge[1]==pairs_edge[[1]][1]){
intercept<-names_edge[2]
}else{
intercept<-names_edge[1]
}
}else{
if(names_edge[1]!=pairs_edge[[1]][1] && names_edge[1]!=pairs_edge[[2]][1]){
intercept<-names_edge[1]
}else if(names_edge[2]!=pairs_edge[[1]][1] && names_edge[2]!=pairs_edge[[2]][1]){
intercept<-names_edge[2]
}else{
intercept<-names_edge[3]
}
}
list_comp_edge<-list()
for(s in 1:length(pairs_edge)){
lrt<- glmLRT(fit, coef=as.numeric(pairs_edge[[s]][2]))
write.table(topTags(lrt,n=50000), file=paste("edgeR_",pairs_edge[[s]][1],"_",intercept,sep=""))
list_comp_edge[[s]]<-topTags(lrt,n=50000)
if(s==2){
vector<-rep(0,as.numeric(pairs_edge[[2]][2]))
vector[as.numeric(pairs_edge[[1]][2])]<-(-1)
vector[as.numeric(pairs_edge[[2]][2])]<-(1)
lrt<- glmLRT(fit, contrast=vector)
write.table(topTags(lrt,n=50000), file=paste("edgeR_",pairs_edge[[1]][1],"_",pairs_edge[[2]][1],sep=""))
list_comp_edge[[3]]<-topTags(lrt,n=50000)
}
}
save(fit,file="fit")
Thanks in advance again!!!!!!!!
Jon
Wow, many many thanks, Aaron!!
I am sorry for the delay. I have been doing many other things and I completely forgot this issue.
I like your answer; however, I have some other ideas. The thing is, I understand that not all the models have to be additive. And the contrast AA vs BB should be the appropriate one if I want to obtain those genes that are differentially expressed, if I am assuming that AB genotype has an intermediate effect on gene expression.
Nonetheless, when I compare AA vs BB, I think I do not have complete power, because the major part of the samples are AB. When I asked this question, I was looking for a method that take the power of these AB samples in the comparison of AA vs BB in order to obtain those genes that are truly differentially expressed (I thought that an additive model was OK). I do not know if I am explaining this idea correctly.
In addition, I read somewhere that I should use a glm function (generalized linear model) if I would like to compare the three genotypes. Does it make sense?
If you have more ideas, I would be very grateful!!
Well, if you fit an additive model and it doesn't fit well for the AB samples (e.g., because of some heterozygote effect that results in non-intermediate behaviour, haplo(in)sufficiency such that expression in the heterozygote is equivalent to that of one of the the homozygotes, etc.), then you'll end up inflating your dispersion estimates. This will result in loss of power for these genes. Moreover, the EB approach shares information between genes, so any increase in the dispersions for the affected genes will propagate to all other genes in the data set. This will end up reducing DE detection power for all genes.
In general, it's not unreasonable to expect that the heterozygote has intermediate behaviour between the two homozygous extremes. However, to try to take advantage of this, you need to make some quantitative statements about how "intermediate" it is expected to be. This would require some fairly strong assumptions; for example, would the expression of a gene in the AB samples be equal to the average of the expressions in AA and BB? There's no guarantee (and in fact, it's probably unlikely) that changing the dosage of alleles between genotypes leads to directionally proportional effects in their regulatory targets.
Finally, I don't know why you're using
glm
, you should useglmFit
if you're using edgeR.In summary, I should make pairwise comparisons between genotypes (AAvsBB, ABvsAA and ABvsBB), shouldn't I? And also, it could be interesting make comparisons like (AA+AB)vsBB, (AA+BB)vsAB and (BB+AB)vsAA in order to see if there is dominance, co-dominance or it is a recessive model, isn't it?
However, I am still quite worried about the AAvsBB comparison, because the major part of the samples have the AB genotype, and it is like.. maybe I could have more DE detection power if there is a method that can use these AB samples in the AAvsBB comparison...
Well, I really appreciate your advice and your knowledge. Many thanks!