Clustering and bootstrapping of RNASeq samples
Entering edit mode
Last seen 3.0 years ago

Dear all,

I try to perform clustering and bootstrapping of 22 samples based on their gene expression. These samples were sequenced by RNASeq, polyA+ selection. I have several questions concerning what I did and what I could try.

1/ I do not find a lot of bootstrapping of RNASeq samples in the literature. Is it done routinely and not published or for some reasons, bootstrapping is not useful for RNASeq samples?

2/ Should we keep the "zero lines", lines with only 0 counts for all samples, for clustering purpose? In my case, I examined 63568 genes (Gencod annotation). I filtered the "0 lines" and genes with very few counts, thus I keep 2 genes sets with either 39311 or 19533 genes.

3/ What do you think about PVclust package for this task?

4/ What do you use for aggregation link and "distance measurement"? I use the average aggregation and measure the correlation.

5/ Until now, I considered log10(normalized_counts+1), using normalized counts from edgeR and DESeq2. The results are very similar, I will keep on with edgeR. Do you think it can be interesting to use rlog or VSD transformation, as recommended in the DESeq2 vignette, on the normalized expression values of egdeR?

6/ I read that normalized counts from edgeR are not recommended to work with, but rather the cpm values. I got very similar results using normalized counts of DESeq2 and edgeR. I feel it is fine to use them for this purpose. What do you think?

If you could provide some help for some of these questions, it would help me a lot.

Thank you in advance.

clustering pvclust vsd edgeR rnaseq • 1.7k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 15 hours ago
The city by the bay
  1. I don't really know what you mean by bootstrapping. Are you referring to the generation of new libraries by resampling reads? Or are you referring to a permutation test, where you shuffle counts between libraries for each gene? The latter is more obvious, but there's several good reasons why this is not done; see A: Differential expression analysis: Permutation Test for details. 
  2. No. Rows with all zeroes provide no information for anything, and should be removed.
  3. Never used it, though I have seen it used in the occasional paper. It's a bit strange because - from what I understand, at least - the "approximately unbiased p-values" that is reported by pvclust will approach unity as the evidence for the cluster becomes stronger. This is counter to the conventional notion of p-values, where the value becomes smaller with increasing evidence.
  4. I assume you're talking about hierarchical clustering here. There's a lot of different ways to do it, none of which are obviously superior to any other. I prefer to use Euclidean distance, as it tends to be more sensitive to subtle differences than correlations; and the ward.D2 option in hclust, as it gives me nice branch lengths that can be easily cut. Note that if you use correlations, you need to take some care when converting it into a distance; see
  5. edgeR doesn't report normalised counts, so I don't know where you got that. The concept of normalised counts has always been a bit iffy to us, because once you normalise, the values aren't really counts anymore. We prefer to consider them as normalised expression values, which is what cpm will give you. Anyway, for high-abundance genes, we find that the log-transformation does a pretty good job of stabilising the variance (see the voom paper), so we don't do anything extra.
  6. See my answer to 5. It's mostly a matter of notation, because once people have "normalised counts", they think that they can use them in the same manner as the original counts. That's a risky assumption to make, depending on what you want to do. I don't know how you got normalised counts out of edgeR, so I can only recommend that you use the output of cpm.
Entering edit mode

I think Jane is referring to the bootstrap probability values which are generated by pvclust, next to the approximately unbiased p-values. See

I have used pvclust for microarray data once, because a peer-reviewer wanted to see the bootstrap p-values in my dendrograms.

Entering edit mode

Ah. Well, drawing from my hazy memories of undergraduate phylogenetics, I think bootstrapping assumes that the features that you're sampling from (to generate your bootstrap replicates) are independent. This would probably not be the case for genes in RNA-seq data, where you get dependencies due to pathways and regulation and whatnot.

Entering edit mode

Yes, I meant probability values generated by pvclust.

Thank you b.nota

Entering edit mode

Thank you a lot for your responses Aaron.

I add some precisions:

1. B.nota answered correctly.

4. Yes, sorry, I talk about hierarchical clustering.

5. I got what I call normalized counts from edgeR like this:

y <- calcNormFactors(y)
scale <- y$samples$lib.size* y$samples$norm.factors
normCounts <- round(t(t(y$counts)/scale)*mean(scale)) #Normalized counts


From the literature, I have the impression that bootstrapping was more used with microarray (as b.nota was asked to do) than with RNASeq. There are probably not more dependencies on microarray than with RNASeq (except there are more analysed features). I hope to get more feedbacks from people with large experience in clustering and PVClust users.


Login before adding your answer.

Traffic: 342 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6