Question

DESeq2 for single cell pseudobulk processing

1

Entering edit mode

r2626 ▴ 10

@74be1760

Last seen 12 months ago

United States

Our data is from single cell sequencing. The goal is trying to calculate Disease DEG for each cell type.

expression matrix. Take Celltype A and Disease A for example. I used the sum raw counts for this cell type and individual as a pseudobulk expression and generated the expression matrix.

For filtering, I removed samples if there were less than 50 cells per sample. I filtered genes if the row sum is smaller than 10. I also filtered genes by only keeping genes with expression larger than 0.5 (cpm normalization) in at least 30% of the samples. I didn’t use any normalization for the expression matrix and still use raw counts as input.

For the test. I know that there are several different test methods that can be used. Wald, LRT, and also LRT(fitType="glmGamPoi") (This one seems recommended for single cell data. This method seems lower the criteria of filtering) I am not sure whether this is still true for pseudobulk or not. . So I only use the default LRT test here.
For covariates in the design. There are several covariates I considered, experimental batch cohort, biological_sex, PMI, average umi counts per cell, and average number of genes detected per cell for each sample.

dds <- DESeqDataSetFromMatrix(countData = expression, colData = meta design = ~ batchCohort+Biological_Sex +PMI+avecellumi+n_gene+ Disease)
DESeq(dds,parallel=TRUE,BPPARAM=MulticoreParam(10), test = "LRT", reduced = ~ batchCohort+Biological_Sex +PMI+avecellumi+n_gene)

I have some questions related to the methods that I use. Q1:shall I use glmGamPoi for single cell pseudobulk? Q2:About avecellumi and log(avecellumi), which one is better to be used as covaraites. Q3:Would correlation between covariates or between one covariate and condition affect the result? Q4: is there any covariates that worth consideration for single cell pseudobulk?(eg. single cell cell count per sample) Or any covariates that should be removed from the design?

DESeq2 • 2.8k views

ADD COMMENT • link updated 14 months ago by Michael Love 42k • written 15 months ago by r2626 ▴ 10

score 1 · Answer 1 · 2023-04-12

1

Entering edit mode

ATpoint ★ 4.2k

@atpoint-13662

Last seen 12 hours ago

Germany

Pseudobulks are not sparse and are basically bulk data towards count characteristics so you can follow the 'normal' workflow as in the vignette. glmGamPoi gives almost identical inference as the standard LRT so you can use that if you want. It improves speed.

Highly-correlated covariates should be avoided, Multiple factor DSEq2 result generates weird volcano plot, generally try to reduce covariates to the minimum to make the contrasts you need while adjusting for unwanted confounders.

What you have to include into your particular design is on you, the support site is not for hands-on guidance. We do not know your particular project and data. Collaborate with a local statistician if needed. Browse recent benchmarking papers to see which strategies have proven useful.

ADD COMMENT • link 15 months ago ATpoint ★ 4.2k

0

Entering edit mode

Thanks for your reply! This is so helpful!

ADD REPLY • link 15 months ago r2626 ▴ 10

2

Entering edit mode

Agree with @ATpoint.

Probably better to use log of a positive, right skewed covariate. Also good to center all covariates.

Yes, avoid correlated covariates, I've actually used RUV for pseudobulk data as well which produces orthogonal nuisance variables. Often the RUV factors explain the known technical covariates anyway, and you don't need to include the known ones if you use the RUV ones in the design.

Actually depending on the sample size, glmGamPoi may or may not be faster. It is much faster with large matrices of repeated integer values.

ADD REPLY • link 15 months ago Michael Love 42k

0

Entering edit mode

Thanks a lot! I have modified my script based on nice suggestions from you all!

ADD REPLY • link 15 months ago r2626 ▴ 10

0

Entering edit mode

So are you saying pseudobulk data can essentially be treated as bulk data? If so, would a Wald test be appropriate for pseudobulk differential expression analysis? I saw these recommendations for single-cell analysis and assumed they applied to pseudobulk data too, but perhaps they were intended for analyses treating single cells as independent observations.

ADD REPLY • link 14 months ago Marie ▴ 10

1

Entering edit mode

For pseudobulk you can treat it like normal, i.e. you don't have to follow those recommendations.

ADD REPLY • link 14 months ago Michael Love 42k