Question

Clustering and bootstrapping of RNASeq samples

1

Entering edit mode

Jane Merlevede ▴ 90

@jane-merlevede-5019

Last seen 5.6 years ago

Dear all,

I try to perform clustering and bootstrapping of 22 samples based on their gene expression. These samples were sequenced by RNASeq, polyA+ selection. I have several questions concerning what I did and what I could try.

1/ I do not find a lot of bootstrapping of RNASeq samples in the literature. Is it done routinely and not published or for some reasons, bootstrapping is not useful for RNASeq samples?

2/ Should we keep the "zero lines", lines with only 0 counts for all samples, for clustering purpose? In my case, I examined 63568 genes (Gencod annotation). I filtered the "0 lines" and genes with very few counts, thus I keep 2 genes sets with either 39311 or 19533 genes.

3/ What do you think about PVclust package for this task?

4/ What do you use for aggregation link and "distance measurement"? I use the average aggregation and measure the correlation.

5/ Until now, I considered log10(normalized_counts+1), using normalized counts from edgeR and DESeq2. The results are very similar, I will keep on with edgeR. Do you think it can be interesting to use rlog or VSD transformation, as recommended in the DESeq2 vignette, on the normalized expression values of egdeR?

6/ I read that normalized counts from edgeR are not recommended to work with, but rather the cpm values. I got very similar results using normalized counts of DESeq2 and edgeR. I feel it is fine to use them for this purpose. What do you think?

If you could provide some help for some of these questions, it would help me a lot.

Thank you in advance.

clustering pvclust vsd edgeR rnaseq • 3.0k views

ADD COMMENT • link written 7.9 years ago by Jane Merlevede ▴ 90

score 1 · Answer 1 · 2016-05-24

I don't really know what you mean by bootstrapping. Are you referring to the generation of new libraries by resampling reads? Or are you referring to a permutation test, where you shuffle counts between libraries for each gene? The latter is more obvious, but there's several good reasons why this is not done; see A: Differential expression analysis: Permutation Test for details.
No. Rows with all zeroes provide no information for anything, and should be removed.
Never used it, though I have seen it used in the occasional paper. It's a bit strange because - from what I understand, at least - the "approximately unbiased p-values" that is reported by pvclust will approach unity as the evidence for the cluster becomes stronger. This is counter to the conventional notion of p-values, where the value becomes smaller with increasing evidence.
I assume you're talking about hierarchical clustering here. There's a lot of different ways to do it, none of which are obviously superior to any other. I prefer to use Euclidean distance, as it tends to be more sensitive to subtle differences than correlations; and the ward.D2 option in hclust, as it gives me nice branch lengths that can be easily cut. Note that if you use correlations, you need to take some care when converting it into a distance; see https://arxiv.org/pdf/1208.3145.pdf.
edgeR doesn't report normalised counts, so I don't know where you got that. The concept of normalised counts has always been a bit iffy to us, because once you normalise, the values aren't really counts anymore. We prefer to consider them as normalised expression values, which is what cpm will give you. Anyway, for high-abundance genes, we find that the log-transformation does a pretty good job of stabilising the variance (see the voom paper), so we don't do anything extra.
See my answer to 5. It's mostly a matter of notation, because once people have "normalised counts", they think that they can use them in the same manner as the original counts. That's a risky assumption to make, depending on what you want to do. I don't know how you got normalised counts out of edgeR, so I can only recommend that you use the output of cpm.