ballgown custom models
2
1
Entering edit mode
sbcn ▴ 70
@sbcn-4752
Last seen 7 weeks ago
Spain

Dear all,

I am using for the first time "ballgown" for differential expression of transcripts (after tophat2 + Stringtie).

I do have troubles interpreting the results (although it is not really ballgown specific but rather my confusion regarding complex/custom models).

Background:
I am working with 58 samples extracted from mice, so I have a "paired" effect to take into account.
Regardless of the paired effect, the samples are assigned to different experimental groups: parts of the brain (x 4), gender (x 2), and time points (x 3 , but I am at the moment not interested in a time course analysis).

The phenotype information is set as:

pData(bg) = data.frame(id=sampleNames(bg), brain=brainPart, gender=genderList, time=timePoints, pairs=samplePairs)

I run:

1.
stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='brain', adjustvars='pairs')

2.
mod = model.matrix(~ pData(bg)$brain + pData(bg)$gender + pData(bg)$time) mod0 = model.matrix(~ pData(bg)$gender + pData(bg)$time) custom_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod, adjustvars="pairs") Are most significant results to be interpreted as: stat_results: "transcripts that are most likely to change in expression across the 4 brain regions, regardless of gender and time" custom_results: "transcripts that are most likely to change in expression across the 4 brain regions, taking into account gender and time" ? Sorry if I'm completely confused here. Also, I would like to know if it is possible to perform a simple pairwise comparison between 2 parts of the brain, even though the "brain" column contains 4 variables? Thanks a lot! Best, sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Scientific Linux release 6.5 (Carbon) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ballgown_2.2.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.32.3 XVector_0.10.0 [3] GenomicRanges_1.22.4 BiocGenerics_0.16.1 [5] splines_3.2.3 zlibbioc_1.16.0 [7] GenomicAlignments_1.6.3 IRanges_2.4.7 [9] BiocParallel_1.4.3 xtable_1.8-2 [11] lattice_0.20-33 GenomeInfoDb_1.6.3 [13] tools_3.2.3 grid_3.2.3 [15] SummarizedExperiment_1.0.2 parallel_3.2.3 [17] nlme_3.1-122 Biobase_2.30.0 [19] mgcv_1.8-9 DBI_0.3.1 [21] genefilter_1.52.1 lambda.r_1.1.7 [23] futile.logger_1.4.1 survival_2.38-3 [25] Matrix_1.2-3 sva_3.18.0 [27] rtracklayer_1.30.3 RColorBrewer_1.1-2 [29] S4Vectors_0.8.11 futile.options_1.0.0 [31] bitops_1.0-6 RCurl_1.95-4.7 [33] RSQLite_1.0.0 limma_3.26.8 [35] Biostrings_2.38.4 Rsamtools_1.22.0 [37] stats4_3.2.3 XML_3.98-1.3 [39] annotate_1.48.0 ballgown linear model model matrix custom model • 2.2k views ADD COMMENT 1 Entering edit mode Alyssa Frazee ▴ 210 @alyssa-frazee-6710 Last seen 10 months ago San Francisco, CA, USA Let's see if I can clarify some of the ballgown-specific questions here! In "stat_results", your specification will perform a differential expression test between brain regions, holding the "pairs" variable constant (this is equivalent to comparing a model including only pairs + library size to a model including pairs + library size + brain). This will *not* do any adjustment for time or gender. The top transcripts here would then be "transcripts showing the most significant expression differences across the 4 brain regions, ignoring gender and time and holding "pairs" constant." It's not totally clear to me from this description what the pairs variable is, but I'm assuming you have two technical replicates for each mouse (?). In that case, R will interpret "pair" as approximately "is this sample replicate A or replicate B for this mouse," which means it will assume that being replicate A vs. replicate B is meaningful in some way. I am guessing this is not exactly what you want. Instead I would suggest either (a) combining expression measurements across technical replicates (i.e., taking the average), or (b) including "mouse" as a variable instead and adjusting for that. (Instead of labeling replicate A vs. replicate B, have mouse ID as a covariate.) If you have more questions on this model setup specifically, I'd recommend talking to a statistician at your institution as this is more a question of how to set up the linear model comparison and less a ballgown-specific question. In "custom_results", your specification will perform a differential expression test between brain regions, holding the "time" and "gender" variables constant and ignoring pairs (and library size). Note that (per the documentation in ?stattest, in the 'Details' section in the paragraph beginning with "'mod' and 'mod0'...") the "adjustvars" argument is ignored when you provide custom models. Specifically (from the docs): "...if you supply ‘mod’ and ‘mod0’, ‘covariate’, ‘timecourse’, ‘adjustvars’, and ‘df’ are ignored, so make sure your covariate of interest and all appropriate confounder adjustments, including library size, are specified in ‘mod’ and ‘mod0’." The top transcripts in "custom_results" as it stands would be "transcripts showing the most significant expression differences across the 4 brain regions, holding gender and time constant and ignoring pairs." As for pairwise comparisons between two brain regions, I'd recommend subsetting your ballgown object to only the two brain regions you are interested in (see ?subset -- there is a subset method for ballgown objects). An example would be: bg_regions12 = subset(bg, 'brain == 1 | brain == 2', genomesubset=FALSE) (to, e.g., subset to just regions 1 and 2 of the brain.) Then you could use "stattest" on "bg_regions12." Hope this helps! ADD COMMENT 0 Entering edit mode Hi, I have a follow up to this. Reading this post and the documentation it isn't clear to me how to add the library size adjustment into the equation when using mod and mod0? The example in the documentation for instance doesn't include it. ADD REPLY 0 Entering edit mode It's often easiest to add the adjustment variable directly to pData (so everything is in the right order and is the right length, etc). After doing that, the idea is to include the library size normalization factor in both mod and mod0. For example, to use the median expression value for each sample as the adjustment, you could do: medians = apply(texpr(bg), 2, median) pData$lib_size = medians  # make sure sample IDs match columns of texpr(bg)
mod0 = model.matrix(~ pData(bg)$lib_size) mod = model.matrix(~ pData(bg)$lib_size + pData(bg)\$group)


The adjustment in the two models is just a covariate in both linear models (so the estimated coefficient for group is change in expression between a sample from one group and a sample from the other, assuming equal library sizes -- the standard interpretation for coefficients in multiple linear regression models).

0
Entering edit mode

Great thanks! That is much clearer to me now.

0
Entering edit mode

'The top transcripts here would then be "transcripts showing the most significant expression differences across the 4 brain regions, ignoring gender and time and holding "pairs" constant."'

Could you please clarify what it means to hold a variable constant? It gets the same fixed value for all samples?

0
Entering edit mode

sorry, I used statistical jargon! usually when discussing how to interpret linear regression coefficients, "hold a variable constant" means we're assuming we're comparing two data points with the same value of that variable. For example, the brain region coefficient "holding pair constant" means it's the difference in expression *between brain regions* we estimate between data points in the same pair.

0
Entering edit mode
sbcn ▴ 70
@sbcn-4752
Last seen 7 weeks ago
Spain

Hi Alyssa,

It definitely helps a lot, thanks!

I understand much better now and was able to run the analysis with more confidence.

What I mean by pairs - maybe I shouldn't name it this way - are not exactly technical replicates: different regions in the brain from the same animal were sampled. Each of these regions will be compared as they are of interest, but as they come from the same animal, it is a bias to take into account.

So essentially for animal A we will take regions 1 and 2, and from animal B we will take regions 1 and 2 as well, and so on. But when we check the difference between region 2 and 1, we want to take into account the fact that there might be an "individual/animal" bias.

So that is what I called pairs and I wanted to introduce in the model: samples from the same animal but that represent different experimental conditions.

I don't know if I am clear :-)

In any case, thanks a lot for your help!