EdgeR question: why I got discrepancy results for DEGs when using different ways setting up the models
1
0
Entering edit mode
Ming ▴ 350
@ming-yi-6363
Last seen 9 weeks ago
United States

Hi, there:

I used edgeR a while back, and just recently I started to use it again for a RNAseq project, and I bet there might be some updates or changes, since I am trying to understand why I got different DEGs results with very different numbers of DEGs use two ways of model settings:

1. setting the model and contrasts with all data together in the same model
2. setting the model and contrasts with only subset of data per contrast

total 24 samples in my dataset are divided into 12 groups, but each group has only 2 samples unfortunately. Here I only discuss 3 contrasts pairwise beteween 2 of the 3 groups (H.A, H.G, H.GA) as shown in code below:

In listed three contrasts (H.G_A, H.GA_A, H.G_GA,), for the first two of the contrasts I saw reasonably similar numbers DEGs in between the two ways of model settings, but for the 3rd contrast (H.G_GA,) , I saw almost no DEGs at all in the 2nd way, but saw quite a lot DEGs in the first way. The details are described below.

The other two contreasts have much simiar results of DEGs in terms of numbers of DEGs despite minor difference, which are not shown here.

Does anyone have insights why this is happening? Which way is the better way? Or how can we tell which way to choose based on each particular dataset? Any advice would be highly appreciated! Not sure if the community has any experience or solution for such situations. I know the numbers of the samples in my dataset are not great for each group (only 2, although I have many groups), and maybe seeing all the data would help assess the variance of the data, but did not expect that including all data in the same roof of model would make such a big difference in DEGs output for the same contrast comparing to including just subset of the data in the model

Looking at the total genes for contrast H.G_GA, it seems that 2nd way filtered out more genes than the first way, although I used keep <- filterByExpr(y,group=Group) in both ways. I did check and find that most of the DEGs from the first way for contrast H.G_GA are indeed in the total gene list of 2nd way after filtering (filterByExpr), but were not called as DEGs by the 2nd way for whatever reasons.

Code should be placed in three backticks as shown below

################################################################################################################
#### the first way: the main code for setting the model and contrasts with all data together in the same model

library(reshape)
library(edgeR)
library(org.Mm.eg.db)

##total 24 samples
> show(tar5)
Types Replicate Status Types.Stat
Ctrl.A.13   Ctrl         1      A     Ctrl.A
Ctrl.A.16   Ctrl         2      A     Ctrl.A
......#more rows for Ctrl not shown
H.A.14         H         1      A        H.A
H.A.17         H         2      A        H.A
H.G.8          H         1      G        H.G
H.G.11         H         2      G        H.G
H.GA.20        H         1     GA       H.GA
H.GA.23        H         2     GA       H.GA
.....##more rows for other groups not shown

....##more columns for more samples not shown
Ctrl.GA.22 Ctrl.WT.1 Ctrl.WT.4 H.A.14 H.A.17 H.G.8
ENSMUSG00000000001.4_Gnai3        8593      9090      8040   8949   9795  6949
ENSMUSG00000000003.15_Pbsn           0         0         0      0      0     0
ENSMUSG00000000028.15_Cdc45       1126      1023       705    949   1041   555
ENSMUSG00000000031.16_H19         6654      4537      8610   1321   1268  2707
ENSMUSG00000000037.16_Scml2        166        99       111     65     47    42
ENSMUSG00000000049.11_Apoh           9        10         6      0      9     1
H.G.11 H.GA.20 H.GA.23 H.WT.2 H.WT.5
ENSMUSG00000000001.4_Gnai3    8979    8749    8435  10465   9567
ENSMUSG00000000003.15_Pbsn       0       0       0      0      0
ENSMUSG00000000028.15_Cdc45    539     766    1048    902    683
ENSMUSG00000000031.16_H19     3471    2270    2157   3324   5896
ENSMUSG00000000037.16_Scml2     46      61      72     59     66
ENSMUSG00000000049.11_Apoh       7       3      12      9      8
....##more columns for more samples not shown

> show("all(colnames(mydata5)==rownames(tar5)")
[1] "all(colnames(mydata5)==rownames(tar5)"
> show(all(colnames(mydata5)==rownames(tar5)))
[1] TRUE

rawdata<-data.frame(SYMBOL=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[2]),
ENSID=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[1]),mydata5, stringsAsFactors = FALSE);

Group <- factor(tar5$Types.Stat) y <- DGEList(counts=rawdata[,3:ncol(rawdata)], genes=rawdata[,1:2], group=Group) o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$SYMBOL)
y <- y[!d,]
idfound <- y$genes$SYMBOL %in% mappedRkeys(org.Mm.egSYMBOL)
y <- y[idfound,];
egSYMBOL<- toTable(org.Mm.egSYMBOL)

m <- match(y$genes$SYMBOL, egSYMBOL$symbol) y$genes$EntrezGene <- egSYMBOL$gene_id[m]

keep <- filterByExpr(y,group=Group)
y <- y[keep, , keep.lib.sizes=FALSE]
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL

y <- calcNormFactors(y)

design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(y)

y<- estimateDisp(y,design, robust=TRUE)
fit <- glmQLFit(y,design,robust=TRUE);

my.contrasts <- makeContrasts(
H.GA_A=H.GA-H.A, H.G_GA=H.G-H.GA,H.G_A=H.G-H.A, levels=design)

for(i in 1:length(colnames(my.contrasts))){
currContrast<-colnames(my.contrasts)[i];
currTypes<-sapply(strsplit(currContrast, split="\\."), function(x) x[1])
show(currContrast)
qlf <- glmQLFTest(fit, contrast=my.contrasts[,currContrast])
show("summary(decideTests(qlf))")
show(summary(decideTests(qlf)))
}

#######################################################################
##The 2nd way: the main code for setting the model and contrasts with only subset of data per contrast as below:

mydata5T<-mydata5;
tar5T<-tar5;

AllTypes<-unique(tar5T$Types) ConL<-list() SamL<-list(); relevelRefL<-list(); kk<-1; ConL[[kk]]<-"G-GA"; SamL[[kk]]<-c("G","GA") relevelRefL[[kk]]<-"GA"; kk<-kk+1; ConL[[kk]]<-"G-A"; SamL[[kk]]<-c("G","A") relevelRefL[[kk]]<-"A"; kk<-kk+1; ConL[[kk]]<-"GA-A"; SamL[[kk]]<-c("GA","A") relevelRefL[[kk]]<-"A"; for(ii in 1:length(AllTypes)){ currTypes<-AllTypes[ii]; for(kk in 1:length(ConL)){ tar5<-tar5T[tar5T$Types==AllTypes[ii] & tar5T$Status %in% SamL[[kk]],] mydata5<-mydata5T[,rownames(tar5)] rawdata<-data.frame(SYMBOL=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[2]), ENSID=sapply(strsplit(rownames(mydata5), split="\\_"), function(x) x[1]),mydata5, stringsAsFactors = FALSE); Group <- factor(tar5$Status);
Group<- relevel(Group, ref=relevelRefL[[kk]]);

y <- DGEList(counts=rawdata[,3:ncol(rawdata)], genes=rawdata[,1:2], group=Group)

o <- order(rowSums(y$counts), decreasing=TRUE) y <- y[o,] d <- duplicated(y$genes$SYMBOL) y <- y[!d,] idfound <- y$genes$SYMBOL %in% mappedRkeys(org.Mm.egSYMBOL) y <- y[idfound,]; egSYMBOL<- toTable(org.Mm.egSYMBOL) m <- match(y$genes$SYMBOL, egSYMBOL$symbol)
y$genes$EntrezGene <- egSYMBOL$gene_id[m] keep <- filterByExpr(y,group=Group) y <- y[keep, , keep.lib.sizes=FALSE] rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene y$genes\$EntrezGene <- NULL
y <- calcNormFactors(y)

design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(y)

y <- estimateDisp(y,design, robust=TRUE)

fit <- glmQLFit(y,design,robust=TRUE);
currContrast <- makeContrasts(ConL[[kk]], levels=design)
qlf <- glmQLFTest(fit, contrast=currContrast)

show(currTypes)
show(ConL[[kk]])
show("summary(decideTests(qlf))")
show(summary(decideTests(qlf)))
}
}

# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session

##The output from the first way of model setting for contrast H.G_GA which is between H.G vs H.GA
[1] "H.G_GA"
[1] "summary(decideTests(qlf))"
1*H.G -1*H.GA
Down             528
NotSig         14280
Up               574

##The output from the 2nd way of model setting for contrast H.G_GA which is between H.G vs H.GA
[1] "H"
[1] "G-GA"
[1] "summary(decideTests(qlf))"
-1*GA 1*G
Down           0
NotSig     14249
Up             1

sessionInfo( )

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] org.Mm.eg.db_3.12.0  AnnotationDbi_1.52.0 IRanges_2.24.1
[4] S4Vectors_0.28.1     Biobase_2.50.0       BiocGenerics_0.38.0
[7] edgeR_3.32.1         limma_3.46.0         reshape_0.8.8
[10] reshape2_1.4.4       ggrepel_0.9.1        ggplot2_3.3.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7       pillar_1.6.2     compiler_4.0.2   plyr_1.8.6
[5] tools_4.0.2      statmod_1.4.36   bit_4.0.4        memoise_2.0.0
[9] RSQLite_2.2.7    lifecycle_1.0.0  tibble_3.1.3     gtable_0.3.0
[13] lattice_0.20-44  pkgconfig_2.0.3  rlang_0.4.11     DBI_1.1.1
[17] fastmap_1.1.0    withr_2.4.2      dplyr_1.0.7      stringr_1.4.0
[21] generics_0.1.0   vctrs_0.3.8      bit64_4.0.5      locfit_1.5-9.4
[25] grid_4.0.2       tidyselect_1.1.1 glue_1.4.2       R6_2.5.0
[29] fansi_0.5.0      blob_1.2.2       purrr_0.3.4      magrittr_2.0.1
[33] splines_4.0.2    scales_1.1.1     ellipsis_0.3.2   assertthat_0.2.1
[37] colorspace_2.0-2 utf8_1.2.2       stringi_1.7.3    munsell_0.5.0
[41] cachem_1.0.5     crayon_1.4.1

RNAseq edgeR DEGs • 167 views
0
Entering edit mode
@gordon-smyth
Last seen 50 minutes ago
WEHI, Melbourne, Australia

edgeR is designed to analyse datasets as a whole rather than to chop them up into little pieces. It is a universal property of anova in general, not just of edgeR. that it is more powerful to analyse all the groups together rather than to do pairwise comparisons with subsets of the data. It is obviously unrealistic to expect to estimate the variance reliably from just n=2 samples, which is the reason that comparing n=2 vs n=2 loses so much power and gives fewer DE genes.

0
Entering edit mode

Dear Dr. Smyth:

Thank you so much for your insightful comments, which are very much expected, convincing, and perfect answer to me, and very helpful as always. I have been offered great advices from you years back and still as always, you are the greatest in the world!! Really appreciated and very happy in R world we have you!!

Best wishes and have a greart weekend!! Ming