Question

How can I avoid artifacts in gene set/pathway scoring by UCell and similar algorithms?

0

Entering edit mode

Omer • 0

@3fdf96af

Last seen 9 months ago

Germany

Hey people, I’m analyzing scRNA-seq data for mice from 6 different biological groups. I am using Seurat (“MetaFeatures”/“AddModuleScore”) and UCell/ssGSEA (via “escape”) to try and look for differences in pathway/gene set representation between these groups. While looking at the results of hundreds of pathways/gene sets, I’ve noticed that most of these results look very similar to one another. I am now quite certain that – in most cases – the (many) differences I see between the experimental groups, in terms of their score for specific certain pathways/gene sets, are an artifact.

I suspect that the problem stems from differences (between the samples) in terms of the average number of unique genes (“nFeature”) and/or in terms of absolute cell numbers. I’m attaching an image with some graphs that exemplify the issue (I’ve removed group/set names, because I’m not allowed to reveal them). The top row includes the factors I suspect may cause the problem, while the bottom row includes UCell scores of a few gene sets that exemplify the problem (I’ve gotten similar results when using Seurat’s “MetaFeatures”/“AddModuleScore” functions). Also, as you can see, two of the six groups are from one batch (“Batch 1”) and the other 4 groups are from a different batch (“Batch 2”). Each group had its own (separate) lane on the 10X Chromium platform. enter image description here

The data were normalized and integrated using Seurat before running the MetaFeatures/AddModuleScore/UCell/ssGSEA functions.

Any idea what I can do in order to remove these artifacts, so that I can get meaningful results?

Cheers,

Omer

escape UCell GSVA • 1.6k views

ADD COMMENT • link updated 5 months ago by arina • 0 • written 14 months ago by Omer • 0

0

Entering edit mode

To note: cross-posted at https://www.biostars.org/p/9554907/

ADD REPLY • link 14 months ago Kevin Blighe ★ 3.9k

score 1 · Answer 1 · 2023-02-20

1

Entering edit mode

MassA ▴ 20

@22fe19f4

Last seen 13 months ago

Switzerland

UCell scores are calculated individually for each cell, so the number of cells in the sample should not be the source of the artifact. I would narrow it down to the number of detected genes.

Can you provide more information: how large are your gene sets? I would not use gene sets larger than the actual number of detected genes (~500 in your example).

-m

ADD COMMENT • link 14 months ago MassA ▴ 20

0

Entering edit mode

First, thanks a lot for the help, MassA!

In the example above, the gene sets consist of:

Gene set 1: 207 genes

Gene set 2: 40 genes

Gene set 3: 416 genes

Gene set 4: 130 genes

Gene set 5: 505 genes

So, I assume this on its own won't explain the problem I'm facing, right?

ADD REPLY • link 14 months ago Omer • 0

1

Entering edit mode

I agree that size of the gene sets does not by itself explain the artifact you are seeing.

I don't have an immediate solution but a suggestion: what if you started from a single dataset, and then generated alternative versions with reduced sequencing depth (e.g. this can be simulated using the ‘downsampleMatrix’ function from the scuttle package). Then you would be able to measure the underlying bias of signature scores in a very controlled setting, without the possible real variability between your batches. With the same logic, one could take a gene set and generate smaller versions of the same gene set by randomly picking out genes, to study the effect of gene set size.

I hope this helps,

-m

ADD REPLY • link 14 months ago MassA ▴ 20

0

Entering edit mode

That sounds like an excellent idea. In fact, a couple of weeks ago I was looking for a package/function that can randomly reduce the sequencing depth, but didn't find any - so, thanks a lot for the suggestion of using Scuttle!

Once I'm done with it (and assuming I don't run into any technical snag) I'll report the results here :)

ADD REPLY • link 14 months ago Omer • 0

0

Entering edit mode

Hi Omer,

Unfortunately I don't have an answer to your issue, but I am facing a very similar problem after integrating datasets that were sequenced at different depth. I was wondering how this solution has worked out for you and if you have any other suggestions.

Cheers, Arina

ADD REPLY • link 5 months ago arina • 0