Clustering RNASeq samples from different libraries using DESeq2
1
0
Entering edit mode
@jane-merlevede-5019
Last seen 3.6 years ago

Hello everybody,

I have two RNASeq experiments that I would like to combine for clustering purpose. Unfortunately, my 2 experiments are rather different!

Experiment 1: 19 primary tumors, 2 conditions: 13 and 6 tumors per condition Paired-ends, 75bp, Next Seq 500 TruSeq Stranded mRNA

Experiment 2: 12 tumor cell lines, the two same conditions: 6 and 6 cell lines per condition Paired-ends, 100bp, Illumina HiSeq4000 TruSeq total RNA Stranded

I need to merge these two experiments since 4 of the cell lines derived from 4 primary tumors. The question is: Will a tumor and its cell line cluster together?

When performing an ascending hierarchical clustering on the sample-to-sample distances with DESeq2 after merging the samples of the two experiments into one study, my samples cluster by library type with both rlog and vsd transformations.
It is not surprising since a lot of genes should have an expression value when using ribodepletion whereas they won't be captured by polyA selection.

I would simply like to eliminate genes that have 0 counts in polyA experiment and counts beyond a certain threshold for at least one sample of the ribodepletion experiment.

I started to check the number of lowly expressed genes in both experiments. It seems that there is no less expressed genes in the experiment with polyA than in the experiment with ribodepletion: for example, 7360 genes have normalized counts below 20 for each 19 samples (polyA) and 9027 genes have normalized counts below 20 for each 12 samples (ribo).
6047 genes have raw counts below 10 for each 19 samples (polyA) and 4473 genes have raw counts below 10 for each 12 samples (ribo). I expected the contrary.

The "ribo" experiment was a bit more sequenced: 77 10^6 raw reads (60-102) for polyA vs 93 10^6 raw reads (73-130) for ribodepletion per strand. The number of (pairs of) reads falling into exons of genes is similar in both experiments: 44 10^6 (30-56) vs 45 10^6 (33-70).

Should I exclude genes not captured by both libraries before sizeFactors estimation? When looking at the log2 of raw and normalized reads counts for the samples of the two experiments, I can still distinguish the two experiments. Might the removal of genes with different expression capture improve sample normalisation?

Do some of you had such problem and eventually a nice solution for exploiting such data?
Jane

clustering DESeq2 • 1.7k views
2
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

hi Jane,

Size factor normalization alone won't help to remove the differences due to protocol. There are many different ways that protocol affects the observed counts, not just by a single multiplicative factor for each sample. Also I wouldn't expect that removing the 0 counts for protocol would do it either.

For testing, you would want to include the protocol in the design formula, which would allow for a shift in the log2 counts for each gene. This is done simply by adding "protocol" to the design formula, you don't alter the counts in any way.

For clustering and EDA, you could do the same thing: remove a shift in the transformed counts across protocol for each gene. This can be done by running removeBatchEffect from the limma package on the transformed counts:

mat <- assay(vsd)
library(limma)
mat2 <- removeBatchEffect(mat, vsd\$protocol)

0
Entering edit mode

My first goal was to check that ribodepletion captured more "genes" than polyA as it is expected. I was not able to verify this with the tests I ran.

I do not figure out how a correction for batch effect could be performed in my case if the genes whose expression is library dependent are not removed, since the batch is not shared across the experiments. It is one library per experiment. So, how is it possible to distinguish library from sample type effects if none primary tumors has been sequenced on riboZero and none cell line has been sequenced on polyA?

I will look carefully at removeBatchEffect(), the info on this function could help. From what I quickly tested, it mixed indeed the samples of the 2 experiments. That's a progress!

Jane

0
Entering edit mode

I'm not sure if this is the source of confusion, but I'm suggesting using this function and providing it the protocol (a factor variable taking values: poly-A or ribo depleted), but nothing to do with experimental batches. So ignoring the name of the function, we're just using it to remove shifts associated with protocol for the purpose of visualizing residual differences in the transformed data.

0
Entering edit mode

Oh sorry, I see the problem now. There is confounding between condition of interest and protocol. I don't know of a solution to help you here. There are so many, very large differences in the normalized counts or expression estimates due to protocol.

0
Entering edit mode

Yes, the protocol is confusing here. I can sum up the 2 experiments like this:

Cond1      Cond2

batch1:   6 ribo       6 ribo     --> cell lines

batch2:   6 polyA   13 polyA --> primary tumors

My question is: will the cell lines cluster with their primary tumors? (There are 2 cell lines related to primary tumours in Cond1 and 2 cell lines related to primary tumours in Cond2).

Correcting for batch effect means correcting differences between tumours and cell lines. These differences should help for the clustering...

0
Entering edit mode

I don't think there is a way to computationally remove the substantial differences in normalized counts or expression estimates due to protocol (ribosomal vs polyA) so that you can compare cell lines to primary tumors.

The confounding means, you are left with differences that are a sum of the cell line vs primary tumor difference and the ribo vs polyA differences, and there is not a way to disentangle these.

0
Entering edit mode

The only way I see is to eliminate most of the library effects, which is mainly, to me, due to genes captured by only one library.

To work on a set of genes whose expression is not library dependent, polyadenylated transcripts seem to be good candidates, since they are supposed to be captured by both methods.

But how to distinguish them from the others?

0
Entering edit mode

So, I've been investigating this recently, and I can tell you there are many more protocol effects (polyA vs ribosomal) than just whether or not it was captured. The protocol has a strong influence on the count that you observe.