Question

P-value histogram of testLinearModel is not uniform

0

Entering edit mode

Lluís Revilla Sancho ▴ 730

@lluis-revilla-sancho

Last seen 20 hours ago

European Union

I'm analyzing a smart-seq dataset with samples from 9 different people with 4 runs and 17 plates. The samples were first treated and then sorted after stimulated with another condition, later were filtered in flow cytometry and then plated. In total, there are 14 different conditions.

I'm following the OSCA guidelines and I found scran::testLinearModel to make comparisons while providing a design that would allow to block or control for other variables. I was checking the output of the histogram of the p-values of a comparison (without blocking or controlling just ~vaccination) and I am surprised by some cells having very similar p-value at specific ranges: around 0.16 and close to 0.46:

histogram of vaccinated vs unvaccinated showing some intervals with high number of cells

While the experiment wasn't designed for scRNA-seq it is the first time analyzing Smart-seq2 data and I'm not sure if I missed any step or why could this be happening. Is this normal/expected (due to normalization effects sequencing depth or other know scRNA-seq procedure) or there is something wrong I can address?

The dataset was filtered by % of mitochondrial percentage, number of reads and number of features and later normalized via scuttle::logNormCounts. The total number of cells after filtering is low ~1135 and while the 1st PCA showed a low % of variance (3%) I didn't observe much batch effect and if any it was at the second dimension (2%).

smart-seq2 scRNAseq scran • 1.3k views

ADD COMMENT • link 10 months ago Lluís Revilla Sancho ▴ 730

score 2 · Accepted Answer · 2023-06-07

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

I normally see p-value histograms like this when using bulk RNA-Seq data that hasn't been filtered to remove genes with lots of zero counts. For bulk analysis the obvious fix is to filter the data, but I know nothing about Smart-Seq2, and it may be that you expect genes with lots of zeros, in which case it's probably just what it is.

ADD COMMENT • link 10 months ago James W. MacDonald 65k

0

Entering edit mode

The work from Charlotte Soneson (https://pubmed.ncbi.nlm.nih.gov/29481549/) points out that for scRNA-seq a good prefiltering improves DE reliability using frameworks like limma (not unexpected, but good they show it formally). When I compare two groups on single-cell level it (for me) works well to require a certain CPM per gene in at least x% of cells. For example, at least one of two groups must express a gene with a CPM (scuttle::calculateCPM) of 1 in > 25% of cells. MA-plots look much more reasonable after such a filter (and utterly violent without it) and it removes genes called as DE due to few outliers while the vast majority of cells has zeros for that gene. Might help here as well.

ADD REPLY • link 10 months ago ATpoint ★ 4.0k

0

Entering edit mode

Yes, I thought about this, but I thought it would show up as just one bump not two. I had already filtered out many genes lowly expressed (the plot is made with 21k genes), but it seems I'll need to be more restrictive. The suggested threshold "1 in > 25% of cells" applied to both groups only removes 617 more genes (Smart-seq2 has up to 96 cells per plate and thus the sequencing depth is quite high). Thanks both!

ADD REPLY • link 10 months ago Lluís Revilla Sancho ▴ 730