Question

How to run allelic imbalance in apeglm

0

Entering edit mode

euge.eugenio ▴ 20

@eugeeugenio-15908

Last seen 10 months ago

United Kingdom

I'm trying to perform an allelic imbalance analysis on some ATAC-seq data generated from a population of human individuals. As such for each SNP there will be some individuals that are heterozygous and some that are homozygous. In this case my table contains counts for the individuals that are heterozygous but NA values for the individuals that are homozygous.

I've tried following the apeglm vignette, Modeling ratios of counts section to perform a beta-binomial fit, but if I present a table with NA values it will fail with the following error:

Error in seq_len(sum(nonzero)): argument must be coercible to non-negative integer
Traceback:

1. system.time({
 .     for (i in 1:niter) {
 .         param <- cbind(theta.hat, cts)
 .         fit.mle <- apeglm(Y = ase.cts, x = x, log.lik = NULL, 
 .             param = param, no.shrink = TRUE, log.link = FALSE, 
 .             method = "betabinCR")
 .         theta.hat <- bbEstDisp(success = ase.cts, size = cts, 
 .             x = x, beta = fit.mle$map, minDisp = 0.01, maxDisp = 5000)
 .     }
 . })
2. apeglm(Y = ase.cts, x = x, log.lik = NULL, param = param, no.shrink = TRUE, 
 .     log.link = FALSE, method = "betabinCR")   # at line 7-8 of file <text>
3. betabinCppRoutine(Y, x, weights, offset, param, prior.control, 
 .     method, result, bounds, optim.method)
4. sapply(seq_len(sum(nonzero)), function(i) {
 .     betabinFn(initC, x = x, y = YNZ[, i], size = sizeNZ[, i], 
 .         theta = theta[i], weights = weightsNZ[, i], sigma = sigma, 
 .         S = S, no.shrink = no.shrink, shrink = shrink, cnst = 0)
 . })
5. lapply(X = X, FUN = FUN, ...)

What's the correct procedure to analyse this data? Do I need to fill in the NA values with something?

I wonder if there is any code/tutorial to run this kind of analysis, including the estimation of p-values?

apeglm • 1.1k views

ADD COMMENT • link 11 months ago • updated 10 months ago euge.eugenio ▴ 20

0

Entering edit mode

Michael Love could you help me? Thanks!

ADD REPLY • link 11 months ago euge.eugenio ▴ 20

score 1 · Answer 1 · 2024-05-20

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 13 hours ago

United States

You can put 1 / 1 for the ratio when you are missing counts and then set the weight for that matrix cell to 1e-6.

Let me know if that's clear or I can post code.

ADD COMMENT • link 11 months ago Michael Love 43k

0

Entering edit mode

I think it's been coded correctly, but now it takes forever to run. It's been running for 2 days and it still hasn't finished. Is that normal?

size of cts is: A matrix: 90049 x 54 of type dbl

This is the code for the first bit, and it's stuck there:

# fill NA
weights <- ifelse(is.na(ase.cts), 1e-6, 1)
ase.cts[is.na(ase.cts)] <- 1
cts[is.na(cts)] <- 2

theta.hat <- 100 # rough initial estimate of dispersion
X=matrix(rep(1,ncol(cts)))
niter <- 3
system.time({
  for (i in 1:niter) {
    param <- cbind(theta.hat, cts)
    fit.mle <- apeglm(Y=ase.cts, x=X, log.lik=NULL, param=param,
                      no.shrink=TRUE, log.link=FALSE, method="betabinCR", weights = weights)
    theta.hat <- bbEstDisp(success=ase.cts, size=cts,
                           x=X, beta=fit.mle$map,
                           minDisp=1, maxDisp=500, weights = weights)
  }
})

ADD REPLY • link 10 months ago euge.eugenio ▴ 20

0

Entering edit mode

apeglm was coded in C++, such that 3000 x 8 matrix in the vignette takes <3 seconds.

Try for the first 100 features or so, just one iteration of the loop (fit.mle then theta.hat updates), and see how long it is taking.

You could also likely do more pre-filtering. What are the 90k features?

ADD REPLY • link 10 months ago Michael Love 43k

0

Entering edit mode

Hi, so I've tried running it on 1000 samples and the 3 iterations take about 30 seconds to run. With 2000 samples it takes 1.5 minutes. I've added a few print statements and it basically never completes the bbEstDisp step for even the first loop. CPU usage is still 100% and memory is about 1.2 gb.

The 90k features are all the SNPs that have enough coverage (10 reads) in at least 3 patients. Do you think if I ran it per chromosome the dispersion estimate would still be valid?

ADD REPLY • link 10 months ago euge.eugenio ▴ 20

0

Entering edit mode

So I had a look at the source code of bbEestDisp and I feel the meaning of weights in apeglm is different from bbEstDisp. It seems that it's fitting theta.hat per row, but it's multiplying the weights matrix for the whole result rather than doing it per row of weights. Maybe that's why it's going exponential?

I tried setting the weights of bbEstDisp to 1 and that seems to run fine, but is that statistically correct (having weights for apeglm but not for bbEstDisp)?

That's the results for chromosome 1 enter image description here

ADD REPLY • link 10 months ago euge.eugenio ▴ 20

0

Entering edit mode

I agree, I think weights hadn't been tested in this step of the beta binomial GLM (in the paper we didn't include weights, and this was added posthoc for support of some experiments).

I've tried to address it here, can you install_github():

https://github.com/azhu513/apeglm/commit/0abaccbfaad42996f4dad3cb5fba89597428e7ac

ADD REPLY • link 10 months ago Michael Love 43k

0

Entering edit mode

Yes, that seems to work. The number of significant SNPs with allelic imbalance went down slightly, but I guess that's expected since all the filled in data would have 1/1 counts so no dispersion?

Just last question, am I understanding correctly that I can use the final S-values as if they were FDR corrected p-values?

ADD REPLY • link 10 months ago euge.eugenio ▴ 20

1

Entering edit mode

Yes and yes.

Thanks for posting the bug! The patch should be in devel and release apeglm soon.