Question

bulk RNA-seq analysis using R : batch effect : One question concerning Combat-seq and one question concerning DESeq2 design = batch + condition

1

Entering edit mode

D-Bug ▴ 10

@abc814fb

Last seen 4 months ago

France

Hello,

I want to analyze the differentially expressed genes (DEG) of 3 cell populations between tumor and healthy blood.

I have bulk RNA-seq data as a txi matrix containing abundance, counts and length matrixes for 30 samples.

I have a metadata file that indicates for each sample:

sample ID
Populations (1,2,3)
Patient (1,2,3,4,5,6,7,8,9,10)
Tissu (tumor or healthy blood)
batch (sequencing n°1 or n°2)
Tumor stage (2 or 3)
Tumor type (histological sub-categories of the tumor: 4 values possible including "healthy" for the blood samples)

batch 1: 4 tumors x 3 populations = 12 samples

batch 2 : 1 tumor x 3 populations + 5 blood x 3 populations = 18 populations

The tumor of the batch n°2 seems to clusterize apart from all other samples, so I would like to remove the batch effect due to sequencing runs (see below).

Is it possible even if I don't have blood samples in the first sequencing run? PCA plot of the 3 populations PCA plot for each subset

I tried to apply the combat-seq algorithm from sva package (see below) but I face a problem: The code runs well and ends for txi$abundance in <1min, but for the txi$counts, the code runs but never ends, it keeps running even after several hours.

Population <- (factor(meta2$Population))
Tissu <- (factor(meta2$Tissu))
Tumor_type <- (factor(meta2$Tumor_type))
Grade <- (factor(meta2$Grade))
covar_mat <- cbind(Population, Tissu, Tumor_type, Grade)

txi$abundance <- sva::ComBat_seq(txi$abundance, batch=metadata$batch, covar_mod=covar_mat)

Output: Found 2 batches Using null model in ComBat-seq. Adjusting for 4 covariate(s) or covariate level(s) Estimating dispersions Fitting the GLM model Shrinkage off - using GLM estimates for parameters Adjusting the data

txi$counts <- sva::ComBat_seq(txi$counts, batch=metadata$batch, covar_mod=covar_mat)

Output: Found 2 batches Using null model in ComBat-seq. Adjusting for 4 covariate(s) or covariate level(s) Estimating dispersions Fitting the GLM model Shrinkage off - using GLM estimates for parameters Adjusting the data but it keep running and never ends.

I have the same issue even if I include only the 2 factors "Population, Tissu" in the covar_mod.

However, if I put group=Tissu instead of covar_mod, the process ends even for the counts matrix but I completely lose the biological effect due to population identity, so I can't do that.

Would you have an idea why it does not work for the counts matrix?

I understood by reading other posts that combat-seq needs the raw counts, so I need it to work on the counts matrix...

Finally, I tried DESeq2 without combat-seq. In the design whether I put "~ Patient + Population" or "~ batch + Population", I obtain exactly the same PCA_plots, I am quite surprised by that. I would have expected a difference even a slight one...

dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ Patient + Population)   
#patient and Population have been transformed as factor before running this.

Moreover, when I subset the dds object to focus for example on the population n°1 to compare between blood and tumor, I have a completely different list of DEGs whether I indicate a design "~ batch + Tissu" or "~ Tissu" alone even if the PCA plots are identical between the 2 designs.

( I can't indicate " ~patient + Tissu" because i obtained the following message: Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.)

dds_pop1 <- dds[, dds$Population == "pop1"]

dds_pop1@colData@listData$batch <- factor(dds_pop1@colData@listData$batch) 
dds_pop1@colData@listData$Tissu <- factor(dds_pop1@colData@listData$Tissu)

dds_pop1 <- DESeqDataSet(dds_pop1, design = ~ batch + Tissu)  #I can't put ~ Patient + Tissu because I have the linear combination error

dds_pop1_DEG <- DESeq(dds_pop1) 
contrast_tissu <- c("Tissu", "breast", "blood") 
res_pop1_unshrunken <- results(dds_pop1_DEG, contrast_tissu, alpha = 0.05)

Do I have to trust the DEG with the design that controls the batch effect ? I can't rely on the PCA distribution because it is the same between the design with or without controling batch

Thank you for any help,

Best regards,

Léo

BatchEffect sva DESeq2 RNASeqData limma • 1.1k views

ADD COMMENT • link updated 8 months ago by James W. MacDonald 65k • written 8 months ago by D-Bug ▴ 10

0

Entering edit mode

Can you show colData(dds)? Aside from that, dds_pop1@colData@listData$batch and direct access to slots like that is pretty clunky. Just do dds$batch.

ADD REPLY • link 8 months ago ATpoint ★ 4.0k

0

Entering edit mode

Thanks for your reply. Ok for dds$batch

Please find below the colData(dds).

Concerning the DEG, my question is to know whether I should include batch ("~ batch + Tissu") or not ("~ Tissu") in the design, the PCA plot is the same in both cases.

enter image description here

ADD REPLY • link 8 months ago D-Bug ▴ 10

score 1 · Answer 1 · 2023-08-10

There are two issues. First, you have aliased blood and batch by only having blood samples in one batch. All things equal, I would probably subset to tumor and blood samples and analyze separately, because the blood samples don't have a batch effect.

Second, you shouldn't be using ComBat, but instead should be using sva (or actually svaseq) to generate surrogate variables that you can include in your model. Or you could use one of the RUV functions. Your choice. The goal of ComBat is to adjust both the mean and variance of the different known batches in order to make them more comparable. That's a thing you can do, but you should then also include the batch effect in your model to adjust the degrees of freedom that you have lost by doing the adjustment. But you have a possible batch effect (which you could account for within your linear model), and possible outliers. Using surrogate variables is a better way to deal with unobserved variability.

Another option is to use the limma-voom pipeline and compute sample weights, which will automatically down-weight samples that are dissimilar to other samples of the same type.