#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Clarification regarding analysis with multi-factorial design
0
14 months ago by
scanchi0
scanchi0 wrote:

The analysis is for human RNAseq dataset with control (N=151) and disease condition 1 (N=107) and disease condition 2 (N=260). There is evidence of batch effects as well as effects of some demographic variables between the groups (MDS/PCA plots). To account for those differences this was the design for the analysis:

expression ~ age + sex + race + presence of risk gene allele + presence of infarcts +

presence of additional pathology + batch + postmortem interval (PMI) + condition.

Of these age and PMI are continuous variables while are remaining are factors with 2 (sex, presence of risk allele, presence of infarcts, condition), 3 (race), 4 (presence of additional pathology) and 9 (Batch) levels.

When I analyze all samples together but include contrasts of interest (condition 1 vs control, condition 2 vs control) I get 0 DGE between condition 1 vs control. When I analyze condition 1 vs control separately I get ~100 DGE (48 Down, 69 Up) with marginal adj.p.values (0.02 - 0.04). The CV when I include all the samples is 0.26 and when I analyze control and condition 1 separately is 0.27. In the analysis of control vs condition 1 the range for prior.df is (2.89-4.75) while the range (3.76-4.19) narrows down in the analysis of all the samples. Similarly, the tagwise dispersion difference which was greater in the analysis of control vs condition 1 (0.004 -~38) when compared to analysis of all the samples (0.004- ~29). So does that mean there is more variability in individual observations or is it variability in the estimation when only the two groups are compared ?

I used the voom with weights for all my analysis. Plotting the sample weights against all covariates does not show any obvious trend or segregation for both the control vs condition 1 analysis and the analysis with all samples. The range for weights and sample weights in both the cases i.e. two group comparison vs all samples comparison are very similar with only improvement in the upper bound of the values in the latter case. I utilized the SVA package to identify any latent variables that might influence the gene expression but get n.sv =0 with the model of interest.

I repeated the condition 1 vs control analysis with DESeq2, with the same design and get 13 DGE (5 Up, 8 Down) with similar padj range (0.01-0.04).

In this case would this be the right conclusions: No apparent DGE between condition 1 vs control when all samples are analyzed together is not reflective of the underlying biology. Even with N > 100 , the high degree of variance in the observations associated with condition 1 makes it statistically under-powered to detect changes. Detection of few DGE with marginal adjusted p-values when condition 1 vs control are analyzed separately point to potential for improvement by increasing sample size ?

Here is the relevant code that was same between two group and all group comparison.

y=DGEList(txi\$counts)
y=calcNormFactors(y)
mean_log_cpm=aveLogCPM(y)
keep_genes=mean_log_cpm>=1
y=y[keep_genes,]
v=voomWithQualityWeights(y,design)
vfit=lmFit(v,design)
contr.matrix=makeContrasts(c1.ctr=condition1-control,c2.ctr=condition2-control,c2.c1=condition2-condition1, levels=colnames(design))
efit=eBayes(vfit,robust=TRUE)
sample_cv=estimateDisp(y,design,robust=TRUE)

dds=DESeqDataSetFromTximport(txi,colData=samples,design) dds=DESeq(dds) res=results(dds,alpha=0.05)

SessionInfo

R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final)

Matrix products: default BLAS: /opt/R/lib64/R/lib/libRblas.so LAPACK: /opt/R/lib64/R/lib/libRlapack.so

locale: [1] C

attached base packages: [1] parallel  stats4    stats     graphics  grDevices utils     datasets [8] methods   base

other attached packages:  [1] tximport_1.4.0             DESeq2_1.16.1  [3] SummarizedExperiment_1.6.3 DelayedArray_0.2.7  [5] matrixStats_0.52.2         Biobase_2.36.2  [7] GenomicRanges_1.28.4       GenomeInfoDb_1.12.2  [9] IRanges_2.10.2             S4Vectors_0.14.3 [11] BiocGenerics_0.22.0        edgeR_3.18.1 [13] limma_3.32.3               sva_3.24.4 [15] BiocParallel_1.10.1        genefilter_1.58.1 [17] mgcv_1.8-22                nlme_3.1-131

loaded via a namespace (and not attached):  [1] bitops_1.0-6            bit64_0.9-7             RColorBrewer_1.1-2  [4] tools_3.4.0             backports_1.1.0         rpart_4.1-11  [7] Hmisc_4.0-3             DBI_0.7                 lazyeval_0.2.0 [10] colorspace_1.3-2        nnet_7.3-12             graphite_1.22.0 [13] gridExtra_2.2.1         bit_1.1-12              compiler_3.4.0 [16] graph_1.54.0            htmlTable_1.9           scales_0.4.1 [19] checkmate_1.8.3         rappdirs_0.3.1          stringr_1.2.0 [22] digest_0.6.12           foreign_0.8-68          DOSE_3.2.0 [25] XVector_0.16.0          base64enc_0.1-3         pkgconfig_2.0.1 [28] htmltools_0.3.6         htmlwidgets_0.9         rlang_0.1.1 [31] RSQLite_2.0             acepack_1.4.1           GOSemSim_2.2.0 [34] RCurl_1.95-4.8          magrittr_1.5            GO.db_3.4.1 [37] GenomeInfoDbData_0.99.0 Formula_1.2-2           Matrix_1.2-12 [40] Rcpp_0.12.12            munsell_0.4.3           stringi_1.1.5 [43] zlibbioc_1.22.0         plyr_1.8.4              qvalue_2.8.0 [46] grid_3.4.0              blob_1.1.0              DO.db_2.9 [49] lattice_0.20-35         splines_3.4.0           annotate_1.54.0 [52] locfit_1.5-9.1          knitr_1.16              fgsea_1.2.1 [55] igraph_1.1.2            geneplotter_1.54.0      reshape2_1.4.2 [58] fastmatch_1.1-0         XML_3.98-1.9            glue_1.2.0 [61] latticeExtra_0.6-28     data.table_1.10.4       gtable_0.2.0 [64] purrr_0.2.4             tidyr_0.7.2             ggplot2_2.2.1 [67] ReactomePA_1.20.2       xtable_1.8-2            reactome.db_1.59.1 [70] survival_2.41-3         tibble_1.3.3            clusterProfiler_3.4.4 [73] rvcheck_0.0.9           AnnotationDbi_1.38.1    memoise_1.1.0 [76] cluster_2.0.6

modified 14 months ago • written 14 months ago by scanchi0

I'd recommend to pick one package and go with that. Here it seems your primary analysis is limma-voom, so I'm removing the DESeq2 tag.

Answer: Clarification regarding analysis with multi-factorial design
0
14 months ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

For starters, you should filter before normalization, though this is not the (major) cause of your problem.

Secondly, your topTable call does not test for condition. Rather, it does an ANOVA testing the null hypothesis that all non-intercept terms in your design matrix are equal to zero, including your various blocking factors (see the default for coef=NULL in ?topTable). I do not see any evidence for a contrasts.fit call to specify comparisons between specific groups. Obviously, the ANOVA null hypothesis will differ when you discard all samples from disease condition 2, so it's hardly a surprise that you get different numbers of DE genes.

Answer: Clarification regarding analysis with multi-factorial design
0
14 months ago by
scanchi0
scanchi0 wrote:

Thank you for the pointers Aaron. I did use the contrast matrix and the specific coef in the topTable call. I have amended my post to reflect that as well. The condition 1 and control are very distinct groups and no significant DGE between the two groups was surprising. The p-value distribution for condition 1 vs control in all the group analysis was not that of null hypothesis and the DGE's for condition 2 vs control > DGE's for condition 2 vs condition 1. This prompted me to look at the two group comparison for condition 1 vs control separately. While there is evidence of some DGE in the two group analysis, the low number as well as the marginal p-adjusted values makes me wonder if I could interpret this to reflect the underlying similarity between the two groups ? Although plausible, I am not sure how this trend would change by increasing N for condition 1.

Secondly, you may think that your groups are biologically different, but if your replicate samples (i.e., patients) are highly variable, then you won't have the statistical power to detect these differences. This is a fairly simple explanation for having no DE genes. If you don't accept this argument, then think of a positive control gene that you expect to be DE between the groups, and plot its expression values. This can often give useful clues as to the cause of loss of power, e.g., batch effects.

Thirdly, I simply cannot parse this sentence: "The p-value distribution for condition 1 vs control in all the group analysis was not that of null hypothesis and the DGE's for condition 2 vs control > DGE's for condition 2 vs condition 1." What do you mean by saying that a p-value distribution "was not that of null hypothesis"? I can't figure out what you mean in your second part either, some language other than a math/English hybrid would be appreciated.

Fourthly, the code in your top post is still incomplete. Did you rename colnames(design)? What are the sva commands? Where is the command you used for subsetting the data set in your condition 1/control-only analysis? What contrast matrix did you use for the condition 1/control-only analysis? How did you define coef?

Fifthly, your versions of the R/Bioconductor packages are out of date. It does no harm to update to Bioconductor 3.6. You can either do it now, or you can do it just before publication and freak out your co-authors when the results change.

Sixthly, assuming that you did everything correctly, it is not surprising that the results are slightly different when you discard all the condition 2 samples. All of the data are used in estimating the variance, even if the comparison is only performed between condition 1 and control, so removing the condition 2 values will affect the outcome. In addition, you are using an additive model, which means that the condition 2 samples will also contribute to the estimation of the various blocking factors (e.g., age, sex, race). This, in turn, will affect the estimation of the disease/control effect. Different data will give different results, that's just a fact of life.

Finally, I doubt that the condition 1/control-only analysis yielded genes with dramatically different p-values from the all-sample analysis, given that the metrics in your original post were very similar between the two analyses. Indeed, with a minimum adjusted p-value of 0.02, the general conclusion is still the same in both analyses; weak to no DE between groups. Common sense applies here: the FDR threshold isn't some magic number where genes suddenly become biologically relevant and interesting as soon as they have an adjusted p-value below 0.05. If you want some DE genes to explore with the all-sample analysis, just consider a higher FDR threshold. Yes, this means that you will have some more false positives, but that's what validation studies are for.