Allele Specific Expression Using edgeR
2
1
Entering edit mode
ugwu.nelson ▴ 10
@ugwunelson-10134
Last seen 6.0 years ago

I have 2 samples each in two genotype groups: wild-type (WT) and heterozygous (HET). And for each gene per sample, I have obtained the raw read counts for two alleles (allele 1 and allele 2)

sample      genotype   allele

individual 1    WT    allele 1 count

individual 1    WT    allele 2 count

individual 2    WT    allele 1 count

individual 2    WT    allele 2 count

individual 3    HET    allele 1 count

individual 3    HET    allele 2 count

individual 4    HET    allele 1 count

individual 4    HET    allele 2 count

I would like to know whether the relative abundance of allele 1:allele 2 differs between genotype groups. That is, does  (allele 1)/(allele 1 + allele 2) differ between genotype groups? Comparing the raw read count of allele 1 between the WT and HET groups is not ideal. The idea to look at the relative abundance of allele 1 to allele 2 within an individual, for all the individuals in one genotype group and then compare that with the other genotype group.

Here is my code so far:

library(edgeR)

rawcounts <- read.csv("file:///input.csv", row.names = "gene")

group <- c(1,1,1,1,2,2,2,2)
y <- DGEList(counts = rawcounts, norm.factors=rep(c(1),8), group=group)

genotype <- factor(c("WT","WT", "WT", "WT","HET","HET", "HET","HET"))

individual <- factor(c(1,1,2,2,3,3,4,4))

allele <- factor(rep(c("allele1", "allele2"),4))
targets <- data.frame(genotype,individual,allele)
targets

individual <- gl(2,2,length = 8)
genotype <- factor(targets$genotype, levels=c("WT","HET")) allele <- factor(targets$allele, levels=c("allele1","allele2"))
targets

design <- model.matrix(~individual:genotype+genotype:allele)
colnames(design)
y <- estimateDisp(y,design)

edger allelicimbalance • 1.4k views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

Consider the abundance of alleles 1 and 2 in the wild-type as W1 and W2, and the abundance in the heterozygote as H1 and H2. (It's not clear to me how you can get counts for two different alleles in a WT genotype, which I would have thought to be homozygous - but anyway, moving on.) Under the null hypothesis, the relative abundances of the two alleles are the same between genotypes, i.e.,

W1/(W1 + W2) = H1/(H1 + H2)

If you do a bit of arithmetic, you'll find that this is equivalent to:

W1/W2 = H1/H2

This is a lot more amenable for the log-link model that edgeR uses. We then go and set up our design matrix:

genotype <- factor(c("WT","WT", "WT", "WT","HET","HET", "HET","HET"))
individual <- factor(c(1,1,2,2,3,3,4,4))
allele <- factor(rep(c("allele1", "allele2"),4))
design <- model.matrix(~0 + individual + genotype:allele)
design <- design[,-(5:6)] # to get to full rank.


We block on individuals to account for any (uninteresting) changes in overall expression between individuals, while also allowing for (interesting) genotype-specific differences between alleles. This is better explained by looking at colnames(design):

[1] "individual1"               "individual2"
[3] "individual3"               "individual4"
[5] "genotypeHET:alleleallele2" "genotypeWT:alleleallele2"

The first four coefficients represent the abundance of allele 1 in each individual. The last two coefficients represent the log-fold change in abundance of allele 2 over allele 1 in the heterozygote or wild-type genotypes. If you set up your contrast vector to compare the two coefficients to each other (you can use makeContrasts for this, if you rename colnames(design) to get rid of :):

con <- c(0,0,0,0,1,-1)

... you can test for differences in W2/W1 and H2/H1 in glmLRT. This is basically the same as testing for differences in relative abundance as you've requested, while avoiding the need to add things together in the denominator.

0
Entering edit mode

Hi Aaron, I hope it is suitable to expand on this experimental design a bit.

I have a study design similar to this, but in addition to contrasting ASE in different groups, I would like to test if the genotype has a significant effect on the degree of ASE across all individuals. So I would like to compare model differences of ASE across individuals where the genotype is included and omitted, respectively.

I this example, I can't just drop the genotype:allele term, because then the model does not test for ASE. Does this mean I have to create a new model like the following:

~0 + individual + allele

and then use that as the reduced matrix for comparison against ~0 + individual + allele:genotype?

Kind regards,

Stephan

1
Entering edit mode

I don't quite understand how your question differs from the setup in the original post. When you're talking about groups, are you referring to your genotypes? If so, and you want to determine whether genotype has an effect on ASE, then you should just use the approach I've described above. After all, if genotype doesn't have any effect, the genotype-specific log-fold changes between alleles should be the same between genotypes. If you have more than two genotypes, the same reasoning can be applied to perform an ANODEV to test for any effect of genotype. Just set up a contrast matrix that compares one of the allele:genotype terms (doesn't matter exactly which one) to every other term.

0
Entering edit mode

Hi Aaron,

Thank you for your rapid response and for filling in the gaps to my vague question. The reason why I was asking was indeed because I have four genotypes, not two. I suspected that this was the case, but just wanted to make sure. Thanks so much!

0
Entering edit mode

Hi Aaron, last question on this (I really appreciate you coming back to this old thread):

I have four genotypes, with varying heterozygosity - the same gene may have 5 SNPs in one genotype, but 10 SNPs in another. My gene counts are composed of adding phased allelic counts across multiple SNPs together, so naturally genes with more SNPs will have a higher read count. Can I still compare allelic ratios across multiple genotypes with the formula provided above? Would the individual term still account for these differences, or would you suggest adding an extra term genotype term to account for this?

I realise its easier to conduct the analysis on SNP level, but I would like to obtain estimates of ASE differences due to genotype on the gene-level.

1
Entering edit mode

Systematic differences in counts are handled by individual. Indeed, genotype is nested within individual, so adding the former won't have any effect if the latter is already there. Fewer counts for particular genotypes will obviously reduce power to detect differences but it will not invalidate any contrasts performed within this framework. (Obviously, if you do something outside of the contrasts, like comparing the number of DE genes across genotypes, then differences in power become important.)

0
Entering edit mode
ugwu.nelson ▴ 10
@ugwunelson-10134
Last seen 6.0 years ago

Thanks for the prompt and detailed response!

I have two different alleles because we crossed two different inbred mouse strains and collected RNA from their F1 for RNA-Seq. Some of these F1s have a mutation in one copy of a gene (HET) while others do not (WT). I should have stated this earlier.

I actually have twelve samples per genotype group and here is the code I used (just in case someone has the same question in the future):

rawcounts <- read.csv("inputFile.csv", row.names = "geneids")
group <- rep(c(1,2), each = 24)
y <- DGEList(counts = rawcounts, norm.factors=rep(c(1),48), group=group)

genotype <- factor(rep(c("WT", "HET"), each = 24))

individual <- factor(rep(c(1:24), each = 2))
allele <- factor(rep(c("allele1", "allele2"),24))
targets <- data.frame(genotype,individual,allele)
targets

genotype individual allele
1        WT          1    allele1
2        WT          1    allele2
3        WT          2    allele1
4        WT          2    allele2
5        WT          3    allele1
6        WT          3    allele2
7        WT          4    allele1
8        WT          4    allele2
9        WT          5    allele1
10       WT          5    allele2
11       WT          6    allele1
12       WT          6    allele2
13       WT          7    allele1
14       WT          7    allele2
15       WT          8    allele1
16       WT          8    allele2
17       WT          9    allele1
18       WT          9    allele2
19       WT         10    allele1
20       WT         10    allele2
21       WT         11    allele1
22       WT         11    allele2
23       WT         12    allele1
24       WT         12    allele2
25      HET         13    allele1
26      HET         13    allele2
27      HET         14    allele1
28      HET         14    allele2
29      HET         15    allele1
30      HET         15    allele2
31      HET         16    allele1
32      HET         16    allele2
33      HET         17    allele1
34      HET         17    allele2
35      HET         18    allele1
36      HET         18    allele2
37      HET         19    allele1
38      HET         19    allele2
39      HET         20    allele1
40      HET         20    allele2
41      HET         21    allele1
42      HET         21    allele2
43      HET         22    allele1
44      HET         22    allele2
45      HET         23    allele1
46      HET         23    allele2
47      HET         24    allele1
48      HET         24    allele2

design <- model.matrix(~0 + individual + genotype:allele)
design <- design[,-(25:26)] # to get to full rank.

colnames(design)
[1] "individual1"           "individual2"           "individual3"           "individual4"           "individual5"           "individual6"           "individual7"
[8] "individual8"           "individual9"           "individual10"          "individual11"          "individual12"          "individual13"          "individual14"
[15] "individual15"          "individual16"          "individual17"          "individual18"          "individual19"          "individual20"          "individual21"
[22] "individual22"          "individual23"          "individual24"          "genotypeHET:alleleallele2" "genotypeWT:alleleallele2"

y <- estimateDisp(y,design)

fit <- glmFit(y,design)
con <- glmLRT(fit, contrast = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))
topTags(con)
logFC   logCPM        LR        PValue           FDR
ENSMUSG00000002504 -7.610567 9.135340 1214.1776 5.059592e-26 6.390771e-26
ENSMUSG00000075705 -7.771027 9.102844 1122.7563 3.740833e-24 2.362523e-24
ENSMUSG00000032855 -6.900392 8.748469 1077.7719 2.239136e-23 9.427508e-23
ENSMUSG00000024132 -7.286323 8.934095 1046.8883 1.155220e-22 3.647897e-22

outfile <- topTags(con, n=Inf, sort.by = "none")

write.csv(outfile,file= "W:/myfolder/outFile.csv")