Question: Approaches to group the RNA-Seq samples with large heterogeneity
0
3.5 years ago by
g.atla0
g.atla0 wrote:

I have a rna-seq data set with a paired design ( 30 tissues with treated and untreated condition ). I am trying to classify the samples in to different subsets as there is lot heterogeneity in the data i.e. a subset of samples might behave differently to the treatment, few might not respond at all etc as its a primary human cells data.

I have tried clustering/PCA analysis ( with in DESeq2) to look how the samples are clustered based on normalised or variancestabilized read counts but the results looks like mixed. I could not clearly identify the subset of samples.

I would like to take a different approach. As it is a paired design, I can calculate the fold-change of normalised gene expression values between each pair of samples and use the fold-change matrix to do a clustering/PCA analysis to identify the outlier samples. This might give me an idea about the samples which behave in a similar manner to treatment effects.

I would like to get more  suggestions on this.

1. What is the best way to calculate fold-change values between each pair. Applying the fold-change formula to the normalised/variancestabilized read counts would be enough ? Or loading each pair of samples into edgeR and DESeq2 to calculate fold-changes would be a good idea ?

2. Any pointers to papers or statistical methods where they have done this sort of analysis to deal with heterogeneity in large data sets is really appreciated.

3. Any other suggestions or statistical methods would be useful.

Thanks.

modified 3.5 years ago by Michael Love25k • written 3.5 years ago by g.atla0
Answer: Approaches to group the RNA-Seq samples with large heterogeneity
1
3.5 years ago by
Michael Love25k
United States
Michael Love25k wrote:

Can you show the plot based on normalized or VST data, scrubbing out any sample identifiers? (FAQ: Image into posts)

"results looks like mixed. I could not clearly identify the subset of samples."

This could mean that there are not clearly defined subsets of samples, no?

1) I would first produce a matrix of fold changes treated/untreated from normalized data plus a pseudocount. You can use normTransform() in DESeq2. The larger the pseudocount, the more shrinkage of fold changes from low counts. You can then try PCA on this matrix. The PCA plot function in DESeq2 is easy to customize:

https://github.com/Bioconductor-mirror/DESeq2/blob/master/R/plots.R#L162-L201

Remember, after log transformation, subtraction corresponds to division in the original space of counts.

Thanks for the response.

#create the paired design.
subjects=factor(c(rep(1:30, each=2)))
treat <- as.factor(rep(c("treat","untreat"),30))
design <- model.matrix(~subjects+treat)

colData <- data.frame(colnames(x),subjects=subjects, treat=treat, row.names=1)

dds <- DESeqDataSetFromMatrix(countData = x, colData = colData, design = ~ subjects + treat)
design(dds) <- formula(~ subjects + treat)
dds <- DESeq(dds)
resdds <- results(dds)
resdds=resdds[order(resdds$padj),] summary(rests,0.05) out of 22348 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 4983, 22% LFC < 0 (down) : 4880, 22% outliers [1] : 0, 0% low counts [2] : 0, 0%  This looks a bit odd as around 45% genes are shown to be differentially regulated. vsd <- varianceStabilizingTransformation(dds) # PCA Plot of VSD with default parameters ( ntop=500) Here I could see the different groups but this does not reflect in DE genes, as shown below. de <- rownames(resdds[na.omit(resdds)$padj<0.05,])

#PCA plot of VSD of Differentially expressed genes at 0.05 FDR. (vsd[de,])
# create the Fold change matrix
norm <- assay(normTransform(dds))
i <- seq.int(1L,60,by = 2L)

#substract the every treat from corresponding untreated samples. This creates a matrix of 22348 genes and  30 columns.
norm.fc <- norm[,i]-norm[,i+1]
#I would also like to know if my approach for calculating FC is correct.
#Create a DESeq2 object for plotting.
subjects=factor(c(rep(1:30)))
colData <- data.frame(colnames(norm.fc),subjects=subjects, row.names=1)
se <- SummarizedExperiment(norm.fc,colData=DataFrame(colData))
data <- plotPCA(DESeqTransform(se), intgroup=c("subjects"), returnData=TRUE)

#Plot a PCA for FC matrix.


1

hi,

Scanning you code, it looks correct. There's nothing here to make me think that the 45% DEG is wrong, or indicating an error in analysis. It could be that you have 60 samples, and so have good power, there are true differences, and you are testing a null hypothesis of no difference across treatment. I would encourage you to investigate the top genes visually using plotCounts or other methods. Note that, when you have many samples, if you are interested only in large differences, you can use the lfcThreshold argument.

The PCA of all the data looks like the samples were prepared in at least two batches. Is this correct? The 'subject' part of the design should account for the batch difference as long as the pairs were always in the same batch. Is this the case?

I need to check thoroughly the confounding factors. The pairs were always in the same batch i.e processed together. Though , according to my knowledge, the paired-design takes care of batch effects, is there something I could to to remove the batch effects or improve the analysis  ?

Yes, "The 'subject' part of the design should account for the batch difference as long as the pairs were always in the same batch"

To update, The PCA plot divides the samples to two groups and the confounding factor here is Gender. The samples from Males and Females separates out into two groups.

Hi,

I am facing the same problem as g.atla last updated. My PCA instead of grouping the samples by the condition (case / control) is grouping by Gender. So, I have decided to use the following command:

dds <- DESeqDataSetFromMatrix(countData = readcounts, colData = colData, design = ~ gender + condition)


With this, I am obtaining PC1: 62% variance and PC2: 13% variance. Would it be OK?

It's typical that PCA would group by sex, and also typical to include it in the design as you have done.

The percent variance on the PCA plot will not change according to the design formula. It's unsupervised.

Thanks Michael. So, this would be all I can do to control for gender right?

Yes this is the typical way to control for a variable while testing another.

I have a new question Michael related to PCA plot:

When I use:

DESeq.ds<-DESeqDataSetFromMatrix(readcounts, colData=coldata, design=~ sex + batch + condition)

DESeq.ds<-estimateSizeFactors(DESeq.ds)
sizeFactors(DESeq.ds)

vsd <- varianceStabilizingTransformation(DESeq.ds, blind=TRUE)

plotPCA(vsd, intgroup=c("condition"))

I obtain the following PCA where the variance in PC1 and PC2 are very low. Figure attached (PLOT_1).

But when I use this other code:

DESeq.ds<-DESeqDataSetFromMatrix(readcounts, colData=coldata, design=~ sex + batch + condition)

DESeq.ds<-estimateSizeFactors(DESeq.ds)
sizeFactors(DESeq.ds)

se <- SummarizedExperiment(log2(counts(DESeq.ds, normalized=TRUE) + 1),
colData=colData(DESeq.ds))

plotPCA( DESeqTransform( se ) )

The result is totally different and also the variance. (Figure attached, PLOT_2) Are both valid?

From the first plot (low variances), does it mean that something is going wrong? As you told me in the previous message, modifying the dessign does not change the percent variance of the PCA plot. So, what should I do?

Thanks so much.

The second plot isn’t a proper variance stabilization and is not a great method therefore (see our paper).

So, would it suppose a problem having so low variance as in the first plot? My dessign is

~ sex + batch + condition   but if I add the age as a covariate the percentage variance of the PCA plot should be the same? What can I do with the data to obtain an increased variance?

Thanks a lot Michael.