Question

Genes upregulated in one condition vs each of the other conditions

0

Entering edit mode

dario.beraldi • 0

@darioberaldi-6991

Last seen 2.6 years ago

United Kingdom

I have a differential expression design with multiple conditions, say A, B, C, D.

Using edgeR I'd like to find genes that are upregulated in condition A vs each of the others conditions, i.e. genes upregulated in: A vs B and in A vs C and in A vs D.

My current strategy is to do all the pairwise comparisons and than intersect the output to find genes with e.g. LogFC > 0 and FDR < 0.05 in each comparisons. Like (partially pseudo-code):

design <- model.matrix(~0 + group)
contrasts <- makeContrasts(
    A_vs_B= A - B,
    A_vs_C= A - C,
    A_vs_D= A - D,
    levels= design
)
fit <- glmFit(y, design)

foreach contrast in contrasts:
    lfc <- glmTreat(fit, contrast= contrast)
    detable <- append(detable, topTags(lfc, n= Inf))

# Now query detable

Is there a better strategy? I think a contrast as A_vs_all= A - (B + C + D)/3 won't work because it compares A vs the average of B, C, D.

(For the record, I don't have genes, I have ATACseq regions and I'm trying to answer the question which ATAC sites are specific to one particular condition?)

edger design • 1.6k views

ADD COMMENT • link updated 5.0 years ago by Gordon Smyth 52k • written 5.0 years ago by dario.beraldi • 0

score 2 · Answer 1 · 2020-07-14

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 20 minutes ago

WEHI, Melbourne, Australia

Kevin has attempted a stricter interpretation of your statement about finding regions specific to one condition, but I will assume that you are simply trying to find regions that are significantly more open (or less open) in A than in any of the other conditions. That's the question posed by the title of your question and is a perfectly sensible way to proceeed. In that case, doing the three comparisons of A with B,C,D as you have is exactly what you need to do.

Collating the three results is quite easy, for example:

AvsB <- glmTreat(fit, contrast=contrast[,1])
AvsC <- glmTreat(fit, contrast=contrast[,2])
AvsD <- glmTreat(fit, contrast=contrast[,3])
Results <- cbind(decideTests(AvsB), decideTests(AvsC), decideTests(AvsD))
RegionsUp <- which( rowSums(Results)== 3 )
RegionsDown <- which( rowSums(Results)== -3 )

Defining a reference set of peaks/regions

IMO there are four major ways to define the regions for differential accessibility analyses of ATAC-seq.

Focus on genes, i.e., preset putative promoter regions or gene bodies (e.g., PMID: 23375371).
Tile windows across the genome, possibly merging adjacent windows that appear concordant (e.g., csaw package).
Pool all the libraries together and use a peak-caller on the the pool. Do a differential accessibility anaysis for the resulting consensus peaks.
Call peaks on each library separately and combine the peaks using some ad hoc intersection/union method.

I use Methods 1-3, all of which control the FDR correctly and have no problem with differing sequencing depths. Method 4 is the most commonly used in the literature, but also the only one that I consider unsatisfactory.

ADD COMMENT • link 5.0 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon, I put a longer comment to Kevin's answer - any thought much appreciated!

ADD REPLY • link 5.0 years ago dario.beraldi • 0

1

Entering edit mode

I edited my answer to address your question about reference regions.

ADD REPLY • link 5.0 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the nice overview. I'm working on a non model organism with very different life stages (it's a multi-host parasite) and I think library quality depends on the life stage. Also, read depth varies a lot - between 30 and 160x. I have opted for a combination of your option 3 and 4: Call peaks combining libraries within stages (I'm using Genrich peak caller) then take the union of the stages. About your option 3, I'm worried that the high-depth/high-quality libraries will dominate the pool.

ADD REPLY • link 5.0 years ago dario.beraldi • 0

0

Entering edit mode

Aaro Lun and I showed that your method, like Method 4, does not control the FDR correctly (PMID: 24852250). Your method is ok though if you want a super sensitive method and you understand that the p-values can't be taken literally.

ADD REPLY • link 5.0 years ago Gordon Smyth 52k

score 1 · Answer 2 · 2020-07-14

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 7 weeks ago

Naas, Republic of Ireland

Hey Dario, good to see you here.

which ATAC sites are specific to one particular condition?

Is this what a differential expression comparison (any comparison) is going to show? A differential expression comparison will tell you what is statistically significantly different between, e.g., A and the other groups, but not necessarily what is unique / specific to A.

In my mind, what you need to do here are the pairwise comparisons, take common peaks / regions from these, and also apply some further metric, such as Z-score or read cut-offs. That is, you need p-values but also some other threshold to define what is truly open [chromatin].

Kevin

ADD COMMENT • link 5.0 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thanks Kevin/Gordon for the welcome and the answers. So the pairwise + intersection is an acceptable strategy. I left the true intent of my question as a footnote because while it starts very simple (which ATAC sites are specific to one particular condition?) it quickly leads into a rabbit hole.

In my opinion, even deciding how to produce the reference set of peaks is not obvious especially when the libraries in one condition worked better than the others or they were sequence deeper (so that condition generate disproportionately more peaks and it appears generally more open, is this a concern?) and (guess what?) you have max three replicates per condition. Basically, for RNAseq you know beforehand where to look for DE but for ATAC/ChIPseq the task of mapping regions and quantifying them are conflated. Then there is the question of thresholds on FDR (which also depends on library quality), logFC, z-score... Or am I seeing problems where there ain't any?!

In fact, I'd like to see a review paper discussing not just how to analyse ChIP/ATACseq but really what questions can be meaningfully asked with these techniques.

ADD REPLY • link 5.0 years ago dario.beraldi • 0