Search
Question: Warning messages from diffSplice in Limma
1
17 months ago by
willj30
France
willj30 wrote:

I have data from Affy Human exon arrays 1.0 ST, and I'm using the diffSplice function in Limma to detect differential splicing. I'm getting some warnings (see below) when I call diffSplice and I'd like to know if I should be concerned about them (results from topSplice and plotSplice look fine as far as I can tell).

The experiment has a Healthy group and 3 disease groups, with a design as follows:

> design
Dis1 Healthy Dis2 Dis3
1    1       0       0   0
2    1       0       0   0
3    1       0       0   0
4    1       0       0   0
5    1       0       0   0
6    1       0       0   0
7    0       1       0   0
8    0       1       0   0
9    0       1       0   0
10   0       1       0   0
11   0       1       0   0
12   0       1       0   0
13   0       0       1   0
14   0       0       1   0
15   0       0       1   0
16   0       0       1   0
17   0       0       1   0
18   0       0       1   0
19   0       0       0   1
20   0       0       0   1
21   0       0       0   1
22   0       0       0   1
23   0       0       0   1
24   0       0       0   1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Conds [1] "contr.treatment" When I run diffSplice it gives some warning messages: > fit <- lmFit(rma.probesets.filt, design) > dSplice <- diffSplice(fit[,"Dis1"], geneid="geneNames", exonid="start") Total number of exons: 787188 Total number of genes: 31123 Number of genes with 1 exon: 4769 Mean number of exons in a gene: 25 Max number of exons in a gene: 504965 Warning messages: 1: In rowsum.default(exon.stat, geneid, reorder = FALSE) : missing values for 'group' 2: In rowsum.default(u2, geneid, reorder = FALSE) : missing values for 'group' 3: In rowsum.default(exon.coefficients * u2, geneid, reorder = FALSE) : missing values for 'group' 4: In rowsum.default(exon.t^2, geneid, reorder = FALSE) : missing values for 'group' Should I be worried about these warning messages? Also (perhaps related?) should I be worried that my geneid column has lots of unlabelled probesets that are all being considered by diffSplice as one giant gene with 504,965 exons? Many thanks for any help/advice. > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.1 LTS locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] genefilter_1.56.0 pd.huex.1.0.st.v2_3.14.1 DBI_0.5-1 RSQLite_1.1-2 [5] oligo_1.38.0 Biostrings_2.42.1 XVector_0.14.0 IRanges_2.8.1 [9] S4Vectors_0.12.1 oligoClasses_1.36.0 limma_3.30.8 Biobase_2.34.0 [13] BiocGenerics_0.20.0 BiocInstaller_1.24.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 GenomeInfoDb_1.10.2 bitops_1.0-6 iterators_1.0.8 [5] tools_3.3.2 zlibbioc_1.20.0 digest_0.6.11 statmod_1.4.27 [9] bit_1.1-12 annotate_1.52.1 memoise_1.0.0 preprocessCore_1.36.0 [13] lattice_0.20-34 ff_2.2-13 Matrix_1.2-7.1 foreach_1.4.3 [17] affxparser_1.46.0 grid_3.3.2 AnnotationDbi_1.36.2 survival_2.40-1 [21] XML_3.98-1.5 codetools_0.2-15 GenomicRanges_1.26.2 splines_3.3.2 [25] SummarizedExperiment_1.4.0 xtable_1.8-2 RCurl_1.95-4.8 affyio_1.44.0 ADD COMMENTlink modified 17 months ago by Gordon Smyth35k • written 17 months ago by willj30 2 17 months ago by Aaron Lun21k Cambridge, United Kingdom Aaron Lun21k wrote: As you may have already realized, the geneNames column of fit$genes contains missing values. This causes the warnings and can be fixed by removing the rows corresponding to the NA values. Otherwise, the function will consider all NA exons as being part of the same gene, which would be obviously silly.

Also, use of fit[,"Dis1"] doesn't make much sense for a design matrix without an intercept. The "exon-specific log-fold changes" in this setting would just be the log-expression of each exon in the Dis1 group, rather than the log-fold change between two groups (e.g., Dis1 and Healthy). Use contrasts.fit prior to running diffSplice if you want to compare exon usage between groups.

2
17 months ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

Yes, you should be worried by the unlabelled probesets. diffSplice() assumes that you know which gene each exon belongs to. You have 787k exons in total but 505k of these have no gene identifier. So you do need to fix this basic data problem before proceeding. Is there no annotation as to which exons belong to the same gene? As Aaron has advised, you will have to remove the exons that have no annotation.

I am going to edit diffSplice() so that it treats all NA values for geneid as different genes, each with a single exon.

Also, you would find it easier to define the design matrix like this:

Conds <- relevel(Conds, ref="Healthy")
design <- model.matrix(~Conds)


That way, each disease will be compared back to healthy. Then

ds <- diffSplice(fit, ...)

will do the differential splicing analysis for all the diseases at once.