Question

Accounting for Batch effect reduces DEGs drastically

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 5 weeks ago

India

Hi Everyone

I have a mouse RNASeq data (Control vs Knockout) and I analysed last year using GRCM28 (See PCA Plot2). The samples segregate clearly by Condition and there is no clear segregation by Year. I repeated the same pipeline this year with only variable of reference genome (GRCMv33).This time I see clear segregation of same samples with 11% variance being explained by PC2. So I decided to account for this batch effect due to Year of sequencing in my design formula: design = ~ year + condition However, this reduces my DEGs drastically to 14 (out of ~18K genes expressed) (see the volcano plots made using ggvolcano)

My question is, should I perform batch correction or not because more than 80% of the variance in the dataset is being explained by my condition. If yes, how can I prevent this weird behavior and explore the reason behind such behavior? Any exploratory suggestion is appreciated!!

enter image description here

DESeq2 • 1.1k views

ADD COMMENT • link 6 months ago • updated 5 months ago rohitsatyam102 ▴ 20

1

Entering edit mode

Are the two volcano plots from the exact same GRCMv33 data set, just analyzed with the different models of ~ condition or ~ Year + condition? If so, I think there is just something not right - how can the inclusion of Year remove almost all variation related to condition such that most genes now have 0 LFC? But yet some of the 0s somehow have -log10FDR = 15?! That just cannot be based on the PCA on the same data - adding year should only increase the condition differences. I would be triple-checking my sample metadata to make sure I had everything coded correctly and the DESeq outputs to make sure I was pulling the correct contrast.

ADD REPLY • link 6 months ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

Yes. The two volcano plots from the exact same GRCMv33 data set and just analyzed with the different models of ~ condition or ~ Year + condition. I know it's weird to have a volcano plot that's present on the right side.

ADD REPLY • link 5 months ago rohitsatyam102 ▴ 20

0

Entering edit mode

Is year numeric or a factor/character?

ADD REPLY • link 6 months ago ATpoint ★ 4.1k

0

Entering edit mode

Hi ATpoint, year is factor. DESeq2 automatically converts the character vector to factor

In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors

ADD REPLY • link 6 months ago rohitsatyam102 ▴ 20

1

Entering edit mode

Yes, I know. But did you make absolutely sure year is a factor? What is the output of is.numeric(dds$year). Results would make a lot of sense with a numeric year.

ADD REPLY • link 6 months ago ATpoint ★ 4.1k

0

Entering edit mode

It is not numeric.

> is.numeric(ddsHTSeq_combined$Year)
[1] FALSE

I even tried this

ddsHTSeq_combined$Year <- as.character(ddsHTSeq_combined$Year)
ddsHTSeq_combined$Year <- factor(ddsHTSeq_combined$Year,levels = c("2020","2021"))

ADD REPLY • link 6 months ago rohitsatyam102 ▴ 20

0

Entering edit mode

I might have missed out one detail which might actually be relevant now that I revisited my code. Previously using old reference genome GRCMv28 I never performed in-silico ribodepletion step using ribodetector but in my current code I did and I realized that control samples have more rRNA content as compared to knockouts i.e. only 34-48% of total reads are non-rRNA reads. I thought it might not be relevant initially because I never carried out this additional step last year and the results we obtained were matching our hypothesis but following up recent papers that carry out this step, we thought removal of such reads should be harmless. But I am not so sure now what's the consensus in the cancer RNAseq community about this?

ADD REPLY • link 6 months ago rohitsatyam102 ▴ 20

0

Entering edit mode

In order to seek answer as to why the same sample didn't segregate as clearly using grcm28 vs as they do now when using grcmv33, I went back to check if the genes majorly contributing to both PC1 and PC2 are same or different. I start by plotting the genes that have more influence on the PCs using PCAtools::biplot. The 5 top loadings for each PCs are shown below

enter image description here

As we can see they look identical. From these loadings I then retrieved top 500 genes and then intersected them and I see that a set of 40 genes are different in both old and new deseq2 object. If version is relevant I am using DESeq2_1.40.2

enter image description here

I also tried to see if these 40 genes in new_500_genes are part of new annotation but only found 3 out of it to be new ruling out that newly detected genes might be influencing the PCS. Besides, when I look at the gene description obtained from both old and new PCA loadings, none of them are ribosomal RNA related gene ruling out the effect of ribodepletion step.

This exploration might not explain why including Year as a co-variate in design formula results in so low DEGs but I was curious if there was an effect due to ribodepletion step.

ADD REPLY • link 6 months ago rohitsatyam102 ▴ 20

0

Entering edit mode

Are the new samples balanced in respect with the conditions of control and knockout? I can imagine that if the new samples are all (or mostly) of one of two conditions, they would soak up a lot of variance. I also don't get how your PCA logically links with the fact that there is the new "year" variable. Could you explain your logic again?

ADD REPLY • link 6 months ago Luca • 0

0

Entering edit mode

I don't understand what do you mean when you ask Are the new samples balanced but if you see the PCA plot, you can observe that we have control and knockout sequenced together in same year (a pair in 2020 and another in 2021). If you see GRCMv33 PCA plot you can see that the pair of samples (control and Knockout) from year 2020 segregates on PC2 (if you draw a horizontal line at 0 intercept on Y -axis). This segregation by year tells me that the "Year" variable is an interesting Covariate and I should account for it in my Deseq2 design formula.

ADD REPLY • link 6 months ago rohitsatyam102 ▴ 20