Question: DESeq2: Sequencing depth as confounding factor after size factor normalization
gravatar for ipso
14 months ago by
ipso0 wrote:


I am performing RNA-Seq analysis on a bulk dataset comprising 4 different groups (a, b, c, d) with 5-6 biological replicates per group. When performing QC plotting, I find that PCA and clustering of vst-transformed values identifies some samples as quite different from the others:

# Read dds
dds <- readRDS(dds.path)

# Variance stabilizing transformation
dds.transform <- varianceStabilizingTransformation(dds, blind=TRUE)

# Plotting PCA
p.pca <- plotPCA(dds.transform,
                 intgroup=c("group"), ntop=500)
nudge <- position_nudge(y=5, x=5)
p.pca + geom_label(aes(label = name), position = nudge)

# Plotting heatmap + clustering
df <- data.frame(group = colData(dds.transform)[,"group"],
                 row.names = colnames(dds.transform))
select.200 = order(rowMeans(counts(dds, normalized=TRUE)),
pheatmap(assay(dds.transform)[select.200,], cluster_rows=F, show_rownames=F,
         cluster_cols=T, annotation_col=df)

Interestingly the samples are the ones with the highest amount of reads (and largely belonging to the same batch). Looking at the heatmaps it also is apparent that they cluster differently because of generally higher reads:

I thought the size factor normalization should take care of this? Removing the batch effect using limma's removeBatchEffect() is also not really helping (makes it worse even):

# Remove batch effect
assay(dds.transform) <- removeBatchEffect(assay(dds.transform), batch=dds.transform$batch)

Is this kind of variation to be expected or do I need to take care of this? Both intra- and intergroup differences theoretically should not be that big. What I am eventually interested in are mainly the differentially expressed genes.

Thank you!

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] limma_3.36.2                vsn_3.48.1                  RColorBrewer_1.1-2          gridExtra_2.3              
 [5] pheatmap_1.0.10             gplots_3.0.1                ggplot2_3.0.0               DESeq2_1.20.0              
 [9] SummarizedExperiment_1.10.1 DelayedArray_0.6.5          BiocParallel_1.14.2         matrixStats_0.54.0         
[13] Biobase_2.40.0              GenomicRanges_1.32.6        GenomeInfoDb_1.16.0         IRanges_2.14.10            
[17] S4Vectors_0.18.3            BiocGenerics_0.26.0   
ADD COMMENTlink modified 14 months ago by Michael Love26k • written 14 months ago by ipso0
Answer: DESeq2: Sequencing depth as confounding factor after size factor normalization
gravatar for Michael Love
14 months ago by
Michael Love26k
United States
Michael Love26k wrote:

What kind of RNA-seq is this by the way? I see you have .5 million reads for most samples.

Can you plot just raw counts from b1 vs b6, c1 vs c6, d1 vs d2? I'm wondering if something else is different about those samples beyond x4 higher depth.

ADD COMMENTlink written 14 months ago by Michael Love26k

Hi Michael,

It's data from hematopoietic stem cells. The dataset is already some years old so the depth is really not that great. Still I need to reanalyse it as it has some unique treatment.

I wasn't really sure what you meant by plotting the raw counts so I attached some basic density plots of the count distribution (log2). 

ADD REPLYlink written 14 months ago by ipso0

Sorry, I meant plot(b1, b6, log="xy"). Scaling normalization would help you when these plots of replicates mostly indicate a shift (a multiplicative factor on the raw count scale).

ADD REPLYlink modified 14 months ago • written 14 months ago by Michael Love26k

Of course, here you go:


For me, c1 vs. c6 and d1 vs. d2 look somewhat linearly correlated. b1 vs. b6 however looks in fact quite weird. 

ADD REPLYlink modified 14 months ago • written 14 months ago by ipso0

All of these look highly variable for replicates. This isn't a sequencing depth issue I think, but some other quality control problem. I'd look into if something was special about these high depth samples.

What do you see with e.g. b5 vs b6?

ADD REPLYlink written 14 months ago by Michael Love26k

You were right. The variability is much less when comparing between the libraries with lower depth. It seems that b1, c1 and d1 are highly different from everything else. b2 to a lesser extend also. a2, c2 and d5 maybe as well, but I believe their deviation should be still acceptable. 

What could be the reason for this high variation?

a samples

b samples

c samples

d samples

ADD REPLYlink modified 14 months ago • written 14 months ago by ipso0

I don’t know. Usually the answer lies not on the computer but in talking to the experimentalists.

ADD REPLYlink written 14 months ago by Michael Love26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 200 users visited in the last hour