Question: Using linear models to find differential gene expression (NGS)
0
Johnny H80 wrote:
Hi. I have found some R/Bioconductor/Genominator code on the web (below) and it measures differential expression of RNA-seq short read data using a general linear model. Can someone explain some basic questions I have? 1) What is the reason for using 2 glm's for measuring differential expression? 2) In the function(y) there are two linear models ran; one with argument y ~ groups and the other with argument y ~ 1. Why do this? 3) What does the offset do? 4) Why use ANOVA; is to compare the linear models? 5) What can we say about results, if adjusted for multiple testing; how would you explain a significant result? 6) Would an adjusted p-value of <= 0.05 be significant? Basically, I don't know much about the statistics done below and any advice or pointers to good literature for this would be a great help. Thank you. > head(geneCountsUI) mut_1_f mut_2_f wt_1_f wt_2_f YAL069W 0 0 0 0 YBL049W 19 18 10 4 # Normalisation of RNA-seq lanes notZero <- which(rowSums(geneCountsUI) != 0) upper.quartiles <- apply(geneCountsUI[notZero, ], 2, function(x) quantile(x, 0.75)) uq.scaled <- upper.quartiles/sum(upper.quartiles) * sum(laneCounts) # Calculating differential expression groups <- factor(rep(c("mut", "wt"), times = c(2, 2))) pvalues <- apply(geneCountsUI[notZero, ], 1, function(y) { fit <- glm(y ~ groups, family = poisson(), offset = log(uq.scaled)) fit0 <- glm(y ~ 1, family = poisson(), offset = log(uq.scaled)) anova(fit0, fit, test = "Chisq")[2, 5] }) [[alternative HTML version deleted]]
• 4.4k views  modified 8.9 years ago by Thomas J Hardcastle180 • written 8.9 years ago by Johnny H80
Answer: Using linear models to find differential gene expression (NGS)
1
Simon Anders3.5k wrote:
Hi Johnny On 09/01/2010 02:25 PM, Johnny H wrote: > I have found some R/Bioconductor/Genominator code on the web (below) and it > measures differential expression of RNA-seq short read data using a general > linear model. The code you quote uses a GLM of the Poisson family to analyse RNA- Seq. Please note that this is not appropriate because it cannot and does not assess biological variation! As this might contradict what other people say here, I demonstrate: Assume you have two conditions ("A" and "B") with two biological replicates each. For this example, we further assume that the library sizes are, due to some magic, the same in all four samples, so that we don't have to worry about normalization. Let's say we have, for a given gene, these counts: > y <- c( A1=180, A2=200, B1=240, B2=260 ) > group <- factor( c( "A", "A", "B", "B" ) ) Let's see the within group mean and standard deviation: > tapply( y, group, mean ) A B 190 250 > tapply( y, group, sd ) A B 14.14214 14.14214 From condition "A" to "B", the counts raise from an average of 190 to an average of 250, with a within-group SD of below 15. This should be significant, and it is: > fit0 <- glm( y ~ 1, family=poisson() ) > fit1 <- glm( y ~ group, family=poisson() ) > anova( fit0, fit1, test="Chisq" ) Analysis of Deviance Table Model 1: y ~ 1 Model 2: y ~ group Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3 18.2681 2 2 1.8533 1 16.415 5.089e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Now, for comparison, look at this data: > y2 <- c( A1=100, A2=280, B1=150, B2=350 ) > group <- factor( c( "A", "A", "B", "B" ) ) The means are the same, but the SD is much larger: > tapply( y2, group, mean ) A B 190 250 > tapply( y2, group, sd ) A B 127.2792 141.4214 An increase by 60 (from 190 to 250) should not be significant, if the standard deviation within group is more than twice as much. But as a Poisson regression does not care about this, we get a very low p value: > fit0 <- glm( y2 ~ 1, family=poisson() ) > fit1 <- glm( y2 ~ group, family=poisson() ) > anova( fit0, fit1, test="Chisq" ) Analysis of Deviance Table Model 1: y2 ~ 1 Model 2: y2 ~ group Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3 187.48 2 2 171.06 1 16.415 5.089e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 The proper thing to do is to use a family that allows for overdispersion, typically the negative binomial family, not the Poisson family. For a detailed discussion of this issue, see our preprint: http://precedings.nature.com/documents/4282/ For a solution of your problem that works for simple comparisons (one condition against another one), see the our "DESeq" package (or Robinson et al.'s "edgeR" package). If you need to do something more sophisticated, like a two-way anova: we are working on it. Simon +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sanders at fs.tum.de
Answer: Using linear models to find differential gene expression (NGS)
0
Kasper Daniel Hansen6.4k wrote:
Answer: Using linear models to find differential gene expression (NGS)
0
Thomas J Hardcastle180 wrote:
Hi Johnny, In addition to the previous answers, you might also like to take a look at the BMC Bioinformatics paper 'baySeq: Empirical Bayes methods for identifying differential expression in sequence count data' which compares the effectiveness of a number of methods of detecting differential expression in sequence data as well as introducing a novel method for such analyses. Tom ** <http: www.biomedcentral.com="" 1471-2105="" 11="" 422=""> -- Dr. Thomas J. Hardcastle Department of Plant Sciences University of Cambridge Downing Street Cambridge, CB2 3EA United Kingdom