DESeq2: Sequencing depth as confounding factor after size factor normalization
1
0
Entering edit mode
ipso • 0
@ipso-9761
Last seen 3.1 years ago

Hi,

I am performing RNA-Seq analysis on a bulk dataset comprising 4 different groups (a, b, c, d) with 5-6 biological replicates per group. When performing QC plotting, I find that PCA and clustering of vst-transformed values identifies some samples as quite different from the others:

# Read dds

# Variance stabilizing transformation
dds.transform <- varianceStabilizingTransformation(dds, blind=TRUE)

# Plotting PCA
p.pca <- plotPCA(dds.transform,
intgroup=c("group"), ntop=500)
nudge <- position_nudge(y=5, x=5)
p.pca + geom_label(aes(label = name), position = nudge)

# Plotting heatmap + clustering
df <- data.frame(group = colData(dds.transform)[,"group"],
row.names = colnames(dds.transform))
select.200 = order(rowMeans(counts(dds, normalized=TRUE)),
decreasing=TRUE)[1:200]
pheatmap(assay(dds.transform)[select.200,], cluster_rows=F, show_rownames=F,
cluster_cols=T, annotation_col=df)

Interestingly the samples are the ones with the highest amount of reads (and largely belonging to the same batch). Looking at the heatmaps it also is apparent that they cluster differently because of generally higher reads:

I thought the size factor normalization should take care of this? Removing the batch effect using limma's removeBatchEffect() is also not really helping (makes it worse even):

# Remove batch effect
assay(dds.transform) <- removeBatchEffect(assay(dds.transform), batch=dds.transform\$batch)

Is this kind of variation to be expected or do I need to take care of this? Both intra- and intergroup differences theoretically should not be that big. What I am eventually interested in are mainly the differentially expressed genes.

Thank you!
Best,
Joe

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.36.2                vsn_3.48.1                  RColorBrewer_1.1-2          gridExtra_2.3
[5] pheatmap_1.0.10             gplots_3.0.1                ggplot2_3.0.0               DESeq2_1.20.0
[9] SummarizedExperiment_1.10.1 DelayedArray_0.6.5          BiocParallel_1.14.2         matrixStats_0.54.0
[13] Biobase_2.40.0              GenomicRanges_1.32.6        GenomeInfoDb_1.16.0         IRanges_2.14.10
[17] S4Vectors_0.18.3            BiocGenerics_0.26.0   
0
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

What kind of RNA-seq is this by the way? I see you have .5 million reads for most samples.

Can you plot just raw counts from b1 vs b6, c1 vs c6, d1 vs d2? I'm wondering if something else is different about those samples beyond x4 higher depth.

0
Entering edit mode

Hi Michael,

It's data from hematopoietic stem cells. The dataset is already some years old so the depth is really not that great. Still I need to reanalyse it as it has some unique treatment.

I wasn't really sure what you meant by plotting the raw counts so I attached some basic density plots of the count distribution (log2).

0
Entering edit mode

Sorry, I meant plot(b1, b6, log="xy"). Scaling normalization would help you when these plots of replicates mostly indicate a shift (a multiplicative factor on the raw count scale).

0
Entering edit mode

Of course, here you go:

For me, c1 vs. c6 and d1 vs. d2 look somewhat linearly correlated. b1 vs. b6 however looks in fact quite weird.

0
Entering edit mode

All of these look highly variable for replicates. This isn't a sequencing depth issue I think, but some other quality control problem. I'd look into if something was special about these high depth samples.

What do you see with e.g. b5 vs b6?

0
Entering edit mode

You were right. The variability is much less when comparing between the libraries with lower depth. It seems that b1, c1 and d1 are highly different from everything else. b2 to a lesser extend also. a2, c2 and d5 maybe as well, but I believe their deviation should be still acceptable.

What could be the reason for this high variation?

### d samples

0
Entering edit mode

I don’t know. Usually the answer lies not on the computer but in talking to the experimentalists.