Question: MasigPro and estimation of theta for negative binomial law
1
nicolas.hipp10 wrote:

Hi Bioconductor !

I have a time course RNAseq data from 8 time-points with 3 replicates per sample. These data were aligned with STAR, I get the featureCount output. With these file, I wan't to use MasigPro package to evaluate which genes are moving during this process. But I've got a problem with the notion of negative binomial model. I read a lot of post to understand that the RNAseq libraries follow this model due to differences between biological replicates inducing that variance differs from mean.

I've got a table with samples in columns and genes in rows, and effectively when I make a plot with log(row mean) vs log(var mean) I can see the deviation from the line mean=var.

But in this package, they said "θ must be specified and it can be computed by using available methods as edgeR" I understand that Theta is a parameter for the glm (seem to be the "shape" of the glm). But I don't know how to determine it. I read that MASS package can do this, I follow this tutoriel to better understand: https://stats.stackexchange.com/questions/10419/what-is-theta-in-a-negative-binomial-regression-fitted-with-r

but It not seem to work if i make

mean <- apply(data, 1, mean)
var <- apply(data, 1, var)
mod.NB<-glm.nb(mean~var)
Error in glm.fitter(x = X, y = Y, w = w, etastart = eta, offset = offset,  :
NA/NaN/Inf dans 'x'
De plus : There were 50 or more warnings (use warnings() to see the first 50)

So I switch to the edgeR notice to understand how to calculate it. I find that "In this article, we call φg the dispersion ", so It seem to be the same theta as MasigPro defines it. But I cannot find the function to calculate it ...

Does-anybody can help me to understand this problem? Maybe I miss something, I'm new in bioinformatics and statistics

Thanks a lot nicolas

delete miss-localisation

Answer: MasigPro and estimation of theta for negative binomial law
1
22 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

There are many tutorials on how to run the edgeR package, for example the edgeR User's Guide. If you are new to bioinformatics, that is where to start.

Finding genes that change with time is quite straightforward using a number of tools. There shouldn't be any need to do your own mathematical analysis.

Hi,

Thanks for this answer, so I read it carefully but I'm still block to understand the calcul :/.

They provide a function to estimate the common dispersion (for negative binomial), so I tried it on my data :

y <- estimateDisp(y, design)

An object of class "DGEList"
$counts .. 37609 more rows ...$samples
group lib.size norm.factors
sample1.1          1 97100006            1
sample2.2          1 96283551            1
Sample3.3          1 99571612            1
Sample4.1     2 81200333            1
Sample5.2     2 78098333            1
19 more rows ...

$common.dispersion  0.178411$AveLogCPM
 -0.2453613 -4.4445004 -1.5667065 -4.8813193 -4.2265591
37609 more elements ...

$trend.method  "locfit"$trended.dispersion
 0.2500373 1.3502977 0.3394539 1.8879130 1.1313951
37609 more elements ...

$span  0.2791707$prior.df
 9.451862

$tagwise.dispersion  0.1713445 1.5359864 0.3270974 2.6560714 1.5352592 37609 more elements ...$prior.n
 0.5907414

\$design
(Intercept) group2 group3 group4 group5 group6 group7 group8
1           1      0      0      0      0      0      0      0
2           1      0      0      0      0      0      0      0
3           1      0      0      0      0      0      0      0
4           1      1      0      0      0      0      0      0
5           1      1      0      0      0      0      0      0
19 more rows ...

So there is a common dispersion parameter =

0.178411

But, It's strange because in MasigPro, they advice " theta: θ parameter for negative.binomial family. By default θ = 10. " or "θ = 5" which seem to be huge compare to my value.

So I'm wondering If I still miss something ...

Thanks

1

OK, so you can continue:

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2:8)
topTags(lrt)

and now you have a list of genes that are differentially expressed between the times, i.e., which are moving through the process.

BTW, theta = 1/ dispersion, but you shouldn't need it.

Yep, I read that in the edgeR manual, if I understand well the coef=2:8 means that all the sample will be compare to 1 which is my T0 ?

Effectively, I will have a look to cluster the gene with the same pattern of expression ;)

And thanks for the theta explanation, I will help me to compare method between edgeR / Deseq2 / masygpro comparaison