The first principal component of variance stabilised transformed RNA-seq data for 350 whole blood samples shows some correlation with total read count for sample, and also with library concentration for sample.
The PCA analysis is performed on read abundance summarized to gene data processed by vst() in DESeq2, which I understand is correcting for sequence depth. So why does this correlation remain? Is it indicative of a problem with the sequencing, or expected for clinical samples (because RNA concentration might relate to real variation eg in how unwell patients are), or have I misunderstood what DESeq2 is doing in the vst function?
The correlation remains when low count outliers are removed.
Full disclosure, I asked a related question on biostars, sorry if this is not the right place to post.
se <- tximeta(coldata = coldata) gse <- summarizeToGene(se) total_counts_by_sample <- round( colSums(assay(gse)) / 1e6, 1 ) # make DESeq object dds <- DESeqDataSet(gse, design = ~ x1 ) # filter zero count genes keep <- rowSums(counts(dds)) > 1 dds <- dds[keep,] # variance stabilising transform vsd <- vst(dds, blind = FALSE) rv <- rowVars(assay(vsd)) select <- order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))] pca <- prcomp(t(assay(vsd)[select, ])) # PC1 ~ total read count (millions) stack(pca$x[,1]) %>% dplyr::rename(pc1 = values) %>% left_join( stack(total_counts_by_sample) %>% dplyr::rename(tot_counts = values), by = "ind" ) %>% ggplot( aes(tot_counts, pc1) ) + geom_point() + geom_smooth() + stat_cor() # PC1 ~ library concentration (metadata) stack(pca$x[,1]) %>% dplyr::rename(pc1 = values) %>% left_join( coldata %>% transmute( ind = names, library_conc = library_conc_q), by = "ind" ) %>% ggplot( aes(library_conc, pc1) ) + geom_point() + geom_smooth() + stat_cor()