Question: Paired design, normalisation and batch effects.
0
3.3 years ago by
g.atla0
g.atla0 wrote:

Dear All,

I would like to clarify certain doubts and have opinions on the rna-seq data analysis I have been doing.

I have RNA-Seq data from paired experiment i.e I have treated and untreated data for the same tissue which is sequenced in the same batch. I have data from 40 different biological samples ( 40 treated and 40 untreated ) and more to come. This data tends to be heterogenous, so I am trying to have higher number of replicates to identify true biological signal.

Now, as this is a paired design, there is no need to worry about the batch effects. But I need to do some exploratory data analysis to see if all the samples shows similar response or if there are any subset of samples that behave differently ( due to various reasons like degradation, stress etc ) . So to identify different responses in the large panel of samples, I am doing clustering analysis along with PCA. Here the problem with the clustering if I take the normalised counts is that when the counts of treat in a sample closely matches with the untreated of another sample, they both (treat and untreated) tend to cluster together. But I want to cluster the samples based on the response to treatment i.e I don't care about the counts but I care if the expression has changed in the sample w.r.t treatment. So I have started to work on the log2 fold change ( calculated from normTransform() of DESeq2 ) of each pair such that the log2 fold change represents the change in expression with in each sample and this log2 fold changes can be compared among samples such that the change in expression of genes could tell me which samples respond to treatment. So instead of working on normalised counts, I am working on a matrix of log2 fold changes of 40 biological replicates.

I do not know if its the best way to do exploratory data analysis on pair wise log2 fold change matrix. The log2 fold changes tend to be very small.  Are there any better ways to identify the samples that show a different response to treatment ? I am using the default normalisation for DE analysis. Should I consider any other normalisation methods for paired design ? Any inputs are greatly appreciated.

sva deseq2 rna-seq eda • 718 views
modified 3.3 years ago by Bernd Klaus560 • written 3.3 years ago by g.atla0

Do you have replicates for the tissues, or is it a single pair per tissue?

I have updated the post. I have 40 biological replicates ( from different patients ) of the same tissue, with treated and untreated data. I need to check which samples showing similar response and if there are any non responders and samples that are responding differently ( due to stress or the the age/gender/BMI of patients etc ).

Answer: Paired design, normalisation and batch effects.
1
3.3 years ago by
Bernd Klaus560
Germany
Bernd Klaus560 wrote:

Hi,

what I find useful in paired two-group designs, is too remove the patient effect and then produce a PCA plot again.

Assuming you did an rlog transform, you can simply use the removeBatchEffect() function from limma:

edata <- assay(rld)
edata <- removeBatchEffect(edata, batch = patient),
design =  model.matrix( ~ condition))

This can change the picture dramatically. As an example, see the following two plots:

1.) Before

http://b-klaus.de/paired_data.svg

2.) After

http://b-klaus.de/paired_data_cleaned.svg

Where we can see that the "before / after" labels for patient 207 have probably been switched.

Note that you should not use the cleaned (i.e. sample effect removed) data  for downstream DE analysis, but rather add the patient factor in this case as a blocking variable.

Bernd

The major concern here is not to remove batch effects for DE analysis. The major goal here is to understand the different responses among the samples i.e do all samples show similar behaviour in gene expression changes before and after treatment or are there any subset of samples that show different response.

PCA considers only the to 'n' genes ( eg. n=500 ) with highest variance across samples, but to understand different responses, I need to use all the genes with certain expression levels and then do EDA like gene/sample clustering.

Sorry if my title is misleading.

thank you very much for the clarification!

In this case, you can just use a clustering method of your choice on the cleaned data or your log-FCs if you want to see which samples cluster together, can't you? I personally have found gaussian mixtures work well, i.e. as in mclust or Rmixmod

http://www.stat.washington.edu/mclust/

http://www.mixmod.org/

If you want to find clusters of genes, you might want to take a look at "coseq":

http://biorxiv.org/content/early/2016/07/24/065607