Search
Question: edgeR trended or tagwise
0
8 months ago by
Andy9160
Netherlands
Andy9160 wrote:

Dear Bioconductor,

I am currently performing an RNAseq analysis on rat samples using edgeR. Data was aligned using STAR against the rat genome (Rnor v6.0) and the genes were counted using featureCounts against the Ensembl Rnor gtf file.

The experimental design comprises two factors: DSS and stimulation.

> phenosheet
DSS Stimulation
1  Water Stimulation
2  Water        Sham
3    DSS Stimulation
4  Water Stimulation
5  Water        Sham
6    DSS        Sham
7    DSS Stimulation
8    DSS        Sham
9  Water Stimulation
10 Water        Sham
11   DSS Stimulation
12   DSS        Sham

Given the design, I adopted an approach similar to example "3.3 Experiments with all combinations of multiple factors" from the edgeR manual.

#Prepare DGEList object
> dgList <- DGEList(counts = gcounts_proc,
genes = rownames(gcounts_proc),
samples = phenosheet)
#Keep genes that have over 1 CPM for at least 3 samples
> dgList <- dgList[rowSums(cpm(dgList)>1) >= 2, , keep.lib.sizes=FALSE]
#Normalize using TMM
> dgList <- calcNormFactors(dgList)
#Design
> design_mat <- model.matrix(~ 0 + paste0(dgList$samples$DSS, "_", dgList$samples$Stimulation))
> colnames(design_mat) <- gsub(".+)(.+)$", "\\1", colnames(design_mat)) > design_mat DSS_Sham DSS_Stimulation Water_Sham Water_Stimulation 1 0 0 0 1 2 0 0 1 0 3 0 1 0 0 4 0 0 0 1 5 0 0 1 0 6 1 0 0 0 7 0 1 0 0 8 1 0 0 0 9 0 0 0 1 10 0 0 1 0 11 0 1 0 0 12 1 0 0 0 #Dispersion estimation > dgList <- estimateDisp(y = dgList, design = design_mat, robust=TRUE) > plotBCV(dgList) > fit <- glmQLFit(y = dgList, design = design_mat, robust = TRUE) > plotQLDisp(fit) #Comparisons > contrast_mat <- makeContrasts( DSS_StimvSham = DSS_Stimulation - DSS_Sham, Water_StimvSham = Water_Stimulation - Water_Sham, Sham_DSSvWater = DSS_Sham - Water_Sham, Stim_DSSvWater = Water_Stimulation - Water_Sham, DSSvWater = (DSS_Sham+DSS_Stimulation)/2-(Water_Sham+Water_Stimulation)/2, StimvSham = (DSS_Stimulation+Water_Stimulation)/2-(DSS_Sham+Water_Sham)/2, StimvShamvDSSvWater = (DSS_Stimulation-DSS_Sham)-(Water_Stimulation-Water_Sham), levels = design_mat) > results_DSS_StimvSham <- glmQLFTest(glmfit = fit, contrast = contrast_mat[, "DSS_StimvSham"]) > topTags_DSS_StimvSham <- topTags(object = results_DSS_StimvSham) > head(topTags_DSS_StimvSham$table)
logFC   logCPM        F       PValue       FDR
1.1020749 4.977046 29.12748 0.0001212882 0.9999431
0.4504407 7.806974 26.97715 0.0001689446 0.9999431
0.9105491 3.790667 24.58781 0.0002559413 0.9999431
-5.1133035 2.364305 27.19418 0.0002968992 0.9999431
2.0377876 1.294093 21.54299 0.0004531409 0.9999431
-1.1755983 3.359916 20.39320 0.0005701034 0.9999431

Surprisingly all adjusted p-values are all 0.9999431. When I generate a jitterplot of log(CPM) of the most DE gene I get the impression that the top gene is differentially expressed when comparing Stimulation vs Sham in the DSS group.

I have played around with some other parameters (i.e. using the tagwise dispersion instead of the default trended dispersion, or simply using the glmLRT function), which yielded adjusted p-values that were lower, which made me think that I was not perhaps not using the right dispersion. However, investigating the BCV plot did not suggest to me that there was anything strange going on where I see a trend where genes with low CPMs tend to have a higher BCV.

Something I did notice was that my QL plot was less squeezed than what was observed in Section 4.2.7, but I am not sure if that is important.

So my question is: Why does glmQLFit using the trended dispersion not work here, and is the usage of tagwise dispersion warranted in this case?

>sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] NDlib_0.0.0.9000   pheatmap_1.0.8     matrixStats_0.52.2 ggplot2_2.2.1      Cairo_1.5-9        edgeR_3.20.6       limma_3.34.5
[8] biomaRt_2.34.1

loaded via a namespace (and not attached):
[1] ProtGenerics_1.10.0           bitops_1.0-6                  bit64_0.9-7                   RColorBrewer_1.1-2
[5] progress_1.1.2                httr_1.3.1                    GenomeInfoDb_1.14.0           tools_3.4.3
[9] backports_1.1.2               R6_2.2.2                      rpart_4.1-11                  Hmisc_4.1-1
[13] DBI_0.7                       lazyeval_0.2.1                BiocGenerics_0.24.0           Gviz_1.22.2
[17] colorspace_1.3-2              nnet_7.3-12                   gridExtra_2.3                 prettyunits_1.0.2
[21] RMySQL_0.10.13                curl_3.1                      bit_1.1-12                    compiler_3.4.3
[25] Biobase_2.38.0                htmlTable_1.11.1              DelayedArray_0.4.1            labeling_0.3
[29] rtracklayer_1.38.2            scales_0.5.0                  checkmate_1.8.5               stringr_1.2.0
[33] digest_0.6.13                 Rsamtools_1.30.0              foreign_0.8-69                XVector_0.18.0
[37] base64enc_0.1-3               dichromat_2.0-0               htmltools_0.3.6               ensembldb_2.2.0
[41] BSgenome_1.46.0               htmlwidgets_0.9               rlang_0.1.6                   rstudioapi_0.7
[45] RSQLite_2.0                   BiocInstaller_1.28.0          shiny_1.0.5                   BiocParallel_1.12.0
[49] acepack_1.4.1                 VariantAnnotation_1.24.5      RCurl_1.95-4.10               magrittr_1.5
[53] GenomeInfoDbData_1.0.0        Formula_1.2-2                 Matrix_1.2-11                 Rcpp_0.12.14
[57] munsell_0.4.3                 S4Vectors_0.16.0              stringi_1.1.6                 yaml_2.1.16
[61] SummarizedExperiment_1.8.1    zlibbioc_1.24.0               plyr_1.8.4                    AnnotationHub_2.10.1
[65] grid_3.4.3                    blob_1.1.0                    ggrepel_0.7.0                 parallel_3.4.3
[69] lattice_0.20-35               Biostrings_2.46.0             splines_3.4.3                 GenomicFeatures_1.30.0
[73] locfit_1.5-9.1                knitr_1.18                    pillar_1.0.1                  GenomicRanges_1.30.1
[77] stats4_3.4.3                  XML_3.98-1.9                  biovizBase_1.26.0             latticeExtra_0.6-28
[81] data.table_1.10.4-3           httpuv_1.3.5                  gtable_0.2.0                  assertthat_0.2.0
[85] mime_0.5                      xtable_1.8-2                  AnnotationFilter_1.2.0        survival_2.41-3
[89] tibble_1.4.1                  GenomicAlignments_1.14.1      AnnotationDbi_1.40.0          memoise_1.1.0
[93] IRanges_2.12.0                cluster_2.0.6                 statmod_1.4.30                interactiveDisplayBase_1.16.0
modified 8 months ago by Aaron Lun21k • written 8 months ago by Andy9160
3
8 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Nothing presented in your post suggests to me that the glmQLF* functions are "not working". When you have thousands of genes, the top gene with the lowest p-value will always look differentially expressed to some degree. That's the whole reason for correcting for multiple testing; you can't trust your eyes when dealing with high-throughput data involving many tests. For example:

# Making up some data:
set.seed(0)
y <- matrix(rnbinom(100000, mu=20, size=10), nrow=10000)
grouping <- gl(2,5)
design <- model.matrix(~grouping)

# Running through edgeR:
library(edgeR)
y <- DGEList(y)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
res <- glmQLFTest(fit)

# Visualizing the best one.
best <- which.max(res$table$logFC)
boxplot(split(cpm(y, log=TRUE)[best,], grouping))

The top gene looks as if it were DE, despite the fact that it's a false positive. This is because all counts were drawn from the same distribution, meaning that the null hypothesis is true for all genes. This is a clear case where the eyes are deceived; the BH method, however, is not fooled (adjusted p-value of 0.68) and correctly does not reject the null.

So the conclusion for your analysis is pretty clear to me. There is no evidence for differential expression between your two conditions. Whether that is due to a lack of power in your experimental design, or due to an underlying biological similarity between the two conditions, is a question that can only be answered with a more highly powered experiment (i.e., more and/or tighter replicates).

To actually answer your question; glmQLFit should not be used with the tagwise dispersion, as the function treats the NB dispersion estimate as known. This is reasonable for a trended estimate where information is shared across many genes to get a precise trend fit; less so for a tagwise estimate, where the information from other genes has much less weight. Using the tagwise dispersions would probably result in some anticonservativeness as the estimation uncertainty of the tagwise values is not properly modelled.

I see, I guess my understanding of the BH method is currently lacking, I'll read more into that. Thank you for the answer

In addition to the reason given by Aaron, there's another reason why tagwise dispersion should not be input to glmQLFit. glmQLFit does empirical Bayes moderation of the QL dispersions. It is critical that the genewise QL dispersions represent genuine gene-to-gene heterogeneity and are not already moderated. Applying empirical Bayes moderation to dispersion values that have already been empirical Bayes moderated by estimateDisp in a gene-specific manner (which is what the tagwise dispersions are) would be mathematical nonsense. You can't do empirical Bayes twice.

I see, thank you for the answer. So even if the NB dispersion is known, you should not use tagwise dispersion for glmQLFit. In which situation would you use tagwise dispersion? Or is it simply a diagnostic measure.

Tagwise dispersions are used in glmLRT, where EB shrinkage stabilises the dispersion estimate for each gene. These would otherwise be very unstable, due to the fact that there's not a lot of information per gene. If I recall correctly, stability improves the power of the test and reduces loss of type I error control, both of which are good things.

From the perspective of the QL framework, though, the tagwise dispersions are simply diagnostic in nature. You can use them to identify particularly variable genes that might be problematic (e.g., X-linked genes in studies where you don't have sex labels, immunoglobulin variable chains when dealing with B cell clones, and so on).