Dear all,
I am using for the first time "ballgown" for differential expression of transcripts (after tophat2 + Stringtie).
I do have troubles interpreting the results (although it is not really ballgown specific but rather my confusion regarding complex/custom models).
Background:
I am working with 58 samples extracted from mice, so I have a "paired" effect to take into account.
Regardless of the paired effect, the samples are assigned to different experimental groups: parts of the brain (x 4), gender (x 2), and time points (x 3 , but I am at the moment not interested in a time course analysis).
The phenotype information is set as:
pData(bg) = data.frame(id=sampleNames(bg), brain=brainPart, gender=genderList, time=timePoints, pairs=samplePairs)
I run:
1.
stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='brain', adjustvars='pairs')
2.
mod = model.matrix(~ pData(bg)$brain + pData(bg)$gender + pData(bg)$time)
mod0 = model.matrix(~ pData(bg)$gender + pData(bg)$time)
custom_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod, adjustvars="pairs")
Are most significant results to be interpreted as:
stat_results: "transcripts that are most likely to change in expression across the 4 brain regions, regardless of gender and time"
custom_results: "transcripts that are most likely to change in expression across the 4 brain regions, taking into account gender and time"
?
Sorry if I'm completely confused here.
Also, I would like to know if it is possible to perform a simple pairwise comparison between 2 parts of the brain, even though the "brain" column contains 4 variables?
Thanks a lot!
Best,
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux release 6.5 (Carbon)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ballgown_2.2.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.32.3 XVector_0.10.0
[3] GenomicRanges_1.22.4 BiocGenerics_0.16.1
[5] splines_3.2.3 zlibbioc_1.16.0
[7] GenomicAlignments_1.6.3 IRanges_2.4.7
[9] BiocParallel_1.4.3 xtable_1.8-2
[11] lattice_0.20-33 GenomeInfoDb_1.6.3
[13] tools_3.2.3 grid_3.2.3
[15] SummarizedExperiment_1.0.2 parallel_3.2.3
[17] nlme_3.1-122 Biobase_2.30.0
[19] mgcv_1.8-9 DBI_0.3.1
[21] genefilter_1.52.1 lambda.r_1.1.7
[23] futile.logger_1.4.1 survival_2.38-3
[25] Matrix_1.2-3 sva_3.18.0
[27] rtracklayer_1.30.3 RColorBrewer_1.1-2
[29] S4Vectors_0.8.11 futile.options_1.0.0
[31] bitops_1.0-6 RCurl_1.95-4.7
[33] RSQLite_1.0.0 limma_3.26.8
[35] Biostrings_2.38.4 Rsamtools_1.22.0
[37] stats4_3.2.3 XML_3.98-1.3
[39] annotate_1.48.0
Hi, I have a follow up to this. Reading this post and the documentation it isn't clear to me how to add the library size adjustment into the equation when using mod and mod0? The example in the documentation for instance doesn't include it.
It's often easiest to add the adjustment variable directly to pData (so everything is in the right order and is the right length, etc). After doing that, the idea is to include the library size normalization factor in both mod and mod0.
For example, to use the median expression value for each sample as the adjustment, you could do:
The adjustment in the two models is just a covariate in both linear models (so the estimated coefficient for `group` is change in expression between a sample from one group and a sample from the other, assuming equal library sizes -- the standard interpretation for coefficients in multiple linear regression models).
Great thanks! That is much clearer to me now.
'The top transcripts here would then be "transcripts showing the most significant expression differences across the 4 brain regions, ignoring gender and time and holding "pairs" constant."'
Could you please clarify what it means to hold a variable constant? It gets the same fixed value for all samples?
sorry, I used statistical jargon! usually when discussing how to interpret linear regression coefficients, "hold a variable constant" means we're assuming we're comparing two data points with the same value of that variable. For example, the brain region coefficient "holding pair constant" means it's the difference in expression *between brain regions* we estimate between data points in the same pair.