Question

PCA for QC should include subset of top variable genes or full set of genes in RNAseq data?

0

Entering edit mode

amnahsiddiqa ▴ 10

@amnahsiddiqa-15854

Last seen 9 months ago

United States

PCA for QC in RNA seq data is to detect sample level outliers and batch effects etc.

I am well aware that plotPCA of DESeq2 uses topn genes for doing PCA.

My naive question from whole community is that do we need to use all genes or top highly variable genes for doing PCA for QC of RNA seq data or any highthrougput omics data.

Now, I have around 200 samples from paired ened rna seq data and

# Read count data
ft <- "./data/rsem_counts.tsv"
readCount <- read.table(file = ft, header = TRUE, row.names = 1, stringsAsFactors = FALSE, check.names = FALSE)
head(colnames(readCount))

# Read metadata
metadata_inputfile <- "./data/metadata_counts.tsv"
metadata <- read.table(file = metadata_inputfile, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
row.names(metadata)=metadata$filenames
metadata$filenames==colnames(readCount)

metadata$visit<-as.factor(metadata$visit)
metadata$response_status<-as.factor(metadata$response_status)


dds <- DESeqDataSetFromMatrix(readCount, 
                              colData=metadata, 
                              ~visit+response_status)
# Check the available columns in colData(dds)
colnames(colData(dds))
colData(dds)
dds 

vsd <- vst(dds, blind=FALSE)

## PCA for topn genes by default that DESEQ is using n=500
DESeq2::plotPCA(vsd,intgroup="Run", n=1000)
DESeq2::plotPCA(vsd,intgroup="Run", n=10000)

?plotPCA

Now looking at top 500 genes there is a strong some sort of factor (probably hidden batch effect?) but nothing in pca with 10000 genes(or whatever number of filtered genes i had after removing lowly expressed genes).

10k genes pca : https://drive.google.com/file/d/1LVKCJ-OJPmucjsfRa5mw-zc4s1w3NUw4/view?usp=sharing

1k genes pca : https://drive.google.com/file/d/1oU5yNxKPLCu8dEjkwEtisxBJtJYv0dZR/view?usp=sharing

I am very confused that I should go for sva here or not. any help in what ways probably i can track or move forward to check this sort of effect is real or not is appreciated.

ANd also what is teh best practice to do pca for QC at sample l;evel? by using topn genes or all gens?

Thanks much.

BatchQC RNASeqData sva BatchEffect • 1.1k views

ADD COMMENT • link updated 10 months ago by ATpoint ★ 4.0k • written 10 months ago by amnahsiddiqa ▴ 10

score 2 · Answer 1 · 2023-06-29

2

Entering edit mode

ATpoint ★ 4.0k

@atpoint-13662

Last seen 23 hours ago

Germany

You use top genes because batch effects affect a subset of genes, not all genes. If all genes were affected pretty much equally then we could simply normalize them away with a scaling factor. Using top genes is motivated by the fact that relevant batch effects are causing strong variability, hence you diagnose with the most variable genes.

ADD COMMENT • link 10 months ago ATpoint ★ 4.0k

0

Entering edit mode

Thanks AT- probably it will be helpful for other people like me to know or have a reference for what could be the factors and why only subset of genes could be affected during sequencing. If you have any pointers I will appreciate that too.

ADD REPLY • link 10 months ago amnahsiddiqa ▴ 10

0

Entering edit mode

That is the definition of batch effects, unwanted technical variation. Again, if all genes were affected equally it would be removed by standard normalization and you would not diagnose it.

ADD REPLY • link 10 months ago ATpoint ★ 4.0k