Question: How to best represent weights in SummarizedExperiment or SingleCellExperiment?
0
3 months ago by
Australia/Melbourne/Monash University Bioinformatics Platform
Paul Harrison70 wrote:

I am trying to work out the best way to represent weights in the newer SummarizedExperiment and SingleCellExperiment classes.

By weights I mean inverse variance (up to some scaling factor) for example as accepted by the lm() function. Weights are a somewhat simplistic but general way of accounting for the varying noise levels between different observations in many types of data. I'm interested in developing generic approaches for things like visualization and principal components analysis based on this. limma closely follows the "Statistical models in S" book's approach to linear modelling, and so the limma EList class has support for weights. This has been a quite convenient data type to use, but I'd like to follow modern Bioconductor standards if possible. limma also has an internal getEAWP function for extracting data including weights from a variety of classes, but this doesn't look like it supports SummarizedExperiement.

• Are there any other packages with support for weights, and what is their preferred representation?

• Would a naming convention for assays be a good way to do this? For example a "log2CPM" assay could have an associated "log2CPMWeights" assay.

There is possibly also a need to clearly distinguish technical variability alone from and technical+biological variability.

modified 3 months ago by davide risso810 • written 3 months ago by Paul Harrison70
1

It looks like SingleCellExperiment has a weights() method, which is hopeful but I'm not clear on which assay the weights are meant to be associated with with.

1

It is not possible for limma to support SummarizedExperiment objects because it is not a sufficiently well-defined class. The Assays can contain anything, so it's impossible for a downstream package like limma to know what sort of analysis might be appropriate.

Ok, it's important not to guess wrong. I'll require the user to do something explicitly marking the assays appropriate to use.

I think my solution will be to add some metadata fields to a SummarizedExperiment defining the appropriate pair of assays to use. I'll then write my code to produce an error if these fields are not present.

(An alternative would be to define a subclass of SummarizedExperiment which was sufficiently well defined. However there's already an ecosystem of subclasses of SummarizedExperiment which I want to be able to build on without subclassing each of them.)

Answer: How to best represent weights in SummarizedExperiment or SingleCellExperiment?
1
3 months ago by
davide risso810
Weill Cornell Medicine
davide risso810 wrote:

Since the weights are a J x N matrix (same dimension of the expression data), the natural thing to do is to create a new assay. This is the approach taken in DESeq2 and zinbwave, which both call this assay simply weights. (So to answer your first question DESeq2 also supports weights, as well as edgeR which does not use SummarizedExperiment, though).

The advantage of taking the same approach, in addition to consistency among packages, is that you can use the weights() method which is simply a wrapper for assay(x, "weights").

Obviously, if you envision the need to save more than one set of weights, you will have to choose more specific names for the assays, but as long as this is well documented, it should not be a problem.

Hope this helps.

Best, Davide

1

It is worth stressing that weights() and other aliases (e.g., counts(), logcounts()) are simply convenience wrappers to avoid having to type out the full assay() command. It is difficult to prescribe the content of these fields in a strict mathematical sense - for example, there are a number of ways that logcounts could be computed, differing in the normalization and added pseudo-count. The same applies for weights - are they frequency weights? Or precision weights? Moreover, weights to be used in a GLM will not have the same interpretation as weights to be used in a linear model. So it's ultimately the user's responsibility to ensure that the weights/counts/whatever being stored in a SE/SCE are appropriate for use in downstream functions. The wrappers are just a gentle prod to package developers to encourage the use of the same name for related concepts.

Thanks Davide and Aaron. I had not thought about the distinction between frequencies and precision weights. So if I'm understanding correctly, zinbwave weights can be thought of as (fractional) frequencies, and contribute to the degrees of freedom as such. Also importantly, they are for "counts" rather than "logcounts".

Given the ambiguity between weights applying to "counts" or "logcounts", I think my approach will be the automatically use "weights" only if it and one other assay is present, and otherwise to require both assay names to be specified explicitly.