Question: edgeR: How can I determine which factors significantly explain my DE genes?
0
gravatar for grabear
17 months ago by
grabear0
grabear0 wrote:

Hello SE.  I'm relatively new to using edgeR and GLM's.  I have been doing some basic research on linear regression.  In many examples of using the stats::glm function, there have been ways to determine which coefficients help explain the data being modeled after.  

For example Taken from Youtube...

    # First we get our data.

    mydata <- read.table("panel80.txt")

    # attach(mydata) # In case you want to work with the variable names directly

    names(mydata) # This shows us all the variable names.

    # options(scipen=20) # suppress "scientific" notation

    options(scipen=NULL) # Brings things back to normal

    # Now we want to make a binary response variable that has a vote for Carter=0 and a vote for Reagan=1.

    mysubsetdata1 <- subset(mydata, VOTE == 1 | VOTE==2) # This keeps only those respondents who actually voted for Carter or Reagan.

    mysubsetdata2<-subset(mysubsetdata1, select=c(VOTE, REAFEEL3, CARFEEL3, INC, AGE, PARTYID)) #This keeps only the variables that we are using.

    #mysubsetdata2

    # In VOTE, a vote for Reagan=1 and a vote for Carter=2. We change that in the loop below when we create the new variable binaryvote.

    binaryvote <- 0 # This initializes the binaryvote variable.

    for (i in 1:nrow(mysubsetdata2)) {

    if (mysubsetdata2$VOTE[i]==2) binaryvote[i]=0 else binaryvote[i]=1

    }

    # Here is another way to do the same thing in one line.

    binaryvote2 <- ifelse((mysubsetdata2$VOTE == 1), 1, 0)

    mysubsetdata3<-cbind(mysubsetdata2, binaryvote, binaryvote2)

    mysubsetdata3 

    # Now we do the logistic regression. 

    logitvote.model <- glm(binaryvote ~ INC + AGE + PARTYID, family=binomial, data=mysubsetdata3) 

    summary(logitvote.model)

The summary(logitvote.model) line should print something like this:

> summary(logitvote.model)

Call:
glm(formula = binaryvote ~ INC + AGE + PARTYID, family = binomial, 
    data = mysubsetdata3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5041  -0.7614   0.3473   0.6110   2.0888  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.330928   0.572542  -5.818 5.96e-09 ***
INC          0.060679   0.021454   2.828  0.00468 ** 
AGE          0.007275   0.007407   0.982  0.32598    
PARTYID      0.730539   0.069137  10.567  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 627.21  on 463  degrees of freedom
Residual deviance: 437.00  on 460  degrees of freedom
  (52 observations deleted due to missingness)
AIC: 445

Number of Fisher Scoring iterations: 5

Here the Coefficients with an asterisk(*) are factors that help explain the response variable (in this case binaryvote).

How can I find this same information for and edgeR example? I'm looking at a large group of patients grouped together by phenotype, and I'm trying to compare several subgroups. Here's what my workflow looks like:

ds <- create_dataset("data/merged_counts_table.csv", "data/annotation.csv")
dataset <- ds$edgeR
dataset$samples$group <- relevel(dataset$samples$group, ref = "E")
keep <- rowSums(cpm(dataset) > 1) >= 2
filtered_object <- dataset[keep, , keep.lib.sizes=FALSE]
norm_object <- calcNormFactors(filtered_object)

# Dispersion
experiment_design <- ~0+group + region + sex
design <- model.matrix(experiment_design, data = norm_object$samples)
> colnames(design)
[1] "groupE"    "groupA"    "groupB"    "groupC"    "groupD"    "regionOFC" "sexMale"  

colnames(design) <- stringr::str_replace(colnames(design), "group", "")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
colnames(design) <- stringr::str_replace(colnames(design), ":", "_")
design <- design[, -15]
design <- design[, -19]
dataset <- estimateDisp(norm_object, design = design)

fit <- glmQLFit(dataset, design, robust=TRUE)
res <- glmQLFTest(fit, contrast=c(0, -0.5, -0.5, 0.5, 0.5, 0, 0)))
is.de <- decideTestsDGE(res)

Here's my sessionInfo:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.5.0         AnnotationDbi_1.40.0       bindrcpp_0.2.2             DESeq2_1.18.1             
 [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.53.1         Biobase_2.38.0            
 [9] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0          
[13] BiocGenerics_0.24.0        edgeR_3.20.9               limma_3.34.9              

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.3          assertthat_0.2.0       Formula_1.2-3          statmod_1.4.30        
 [6] DO.db_2.9              rvcheck_0.0.9          latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.0.0
[11] yaml_2.1.19            pillar_1.2.2           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35       
[16] glue_1.2.0             digest_0.6.15          RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5       
[21] qvalue_2.10.0          colorspace_1.3-2       htmltools_0.3.6        Matrix_1.2-12          plyr_1.8.4            
[26] XML_3.98-1.11          pkgconfig_2.0.1        genefilter_1.60.0      zlibbioc_1.24.0        xtable_1.8-2          
[31] GO.db_3.5.0            scales_0.5.0           BiocParallel_1.12.0    annotate_1.56.2        tibble_1.4.2          
[36] htmlTable_1.11.2       ggplot2_2.2.1          nnet_7.3-12            lazyeval_0.2.1         survival_2.41-3       
[41] magrittr_1.5           memoise_1.1.0          DOSE_3.4.0             MASS_7.3-50            foreign_0.8-69        
[46] tools_3.4.3            data.table_1.11.2      stringr_1.3.1          munsell_0.4.3          locfit_1.5-9.1        
[51] cluster_2.0.6          compiler_3.4.3         rlang_0.2.0            grid_3.4.3             RCurl_1.95-4.10       
[56] rstudioapi_0.7         htmlwidgets_1.2        igraph_1.2.1           bitops_1.0-6           base64enc_0.1-3       
[61] gtable_0.2.0           DBI_1.0.0              R6_2.2.2               reshape2_1.4.3         gridExtra_2.3         
[66] dplyr_0.7.4            knitr_1.20             bit_1.1-12             bindr_0.1.1            fastmatch_1.1-0       
[71] Hmisc_4.1-1            fgsea_1.4.1            GOSemSim_2.4.1         stringi_1.2.2          Rcpp_0.12.16          
[76] geneplotter_1.56.0     rpart_4.1-13           acepack_1.4.1
ADD COMMENTlink modified 17 months ago by Gordon Smyth39k • written 17 months ago by grabear0
Answer: edgeR: How can I determine which factors significantly explain my DE genes?
3
gravatar for Gordon Smyth
17 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

You're essentially just asking how to get p-values out of edgeR. You can see p-values from glm() or lm() by using summary(fitted.model). How can you see p-values from edgeR? The answer is simply use to use:

topTable(res)

Let me make a few observations here:

  1. summary(glm()) gives p-values for each coefficient for one response variable. edgeR however gives you all genewise p-values for one coefficient, because that is a much easier way to interpret RNA-seq data. In other words, edgeR gives you all the p-values for one coefficient for lots of response variables. To see multiple p-values for one gene, you have to make several contrasts separately.
  2. summary(glm()) only gives you p-values for coefficients in the design matrix. edgeR however allows you to make arbitrary contrasts of the coefficients, which you have taken advantage of in your example. Using glm() it would be very difficult to compare groups C+D to A+B as in your edgeR example.
  3. The whole concept of "which groups explain the DE results" doesn't really apply to your RNA-seq analysis. In the RNA-seq analysis, you want to know which genes are different between groups. It's not about which groups explain the results but about which differences between groups are important for each gene. Or, to put it the other way around, which genes are DE for which comparisons ... which is what edgeR gives you.
  4. summary(glm()) uses a very simple type of statistical test called a Wald test. This is however known to perform very poorly for binary regression. edgeR uses a much better test called a likelihood ratio test.
  5. edgeR adjusts the p-values for multiple testing across genes to give you false discovery rates (FDR). glm() of course doesn't need to do this because it's working with only one response variable.
ADD COMMENTlink modified 17 months ago • written 17 months ago by Gordon Smyth39k

Ok.  Thanks!  I'm a biomedical engineer by design, but I've been retraining myself to do computation biology over the past few years.

What you said makes sense, and some of which crossed my mind.  However, I don't like to put my data into a conceptual black box.  Now I feel much more confident in moving forward.

 

Also, I'm about to start getting involved with a new python port of edgeR called edgePy.  The r/bioinformatics community has come together to begin writing it.  https://github.com/r-bioinformatics/edgePy

ADD REPLYlink written 17 months ago by grabear0
1

Thanks for letting me know about edgePy. I think you will find this a very big project. Also, I don't see how edgePy hopes to be faster than the original, since edgeR is primarily written in C++.

ADD REPLYlink modified 17 months ago • written 17 months ago by Gordon Smyth39k

Well they have a plan.  I believe they will be using NumPy's an SciPy's math library, which is written in C.  So it should be comparable, but I'm not sure to be honest.  I think some people just prefer Python over R.  I personally like both, and use both.

ADD REPLYlink written 17 months ago by grabear0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 335 users visited in the last hour