Dear all,
I try to perform clustering and bootstrapping of 22 samples based on their gene expression. These samples were sequenced by RNASeq, polyA+ selection. I have several questions concerning what I did and what I could try.
1/ I do not find a lot of bootstrapping of RNASeq samples in the literature. Is it done routinely and not published or for some reasons, bootstrapping is not useful for RNASeq samples?
2/ Should we keep the "zero lines", lines with only 0 counts for all samples, for clustering purpose? In my case, I examined 63568 genes (Gencod annotation). I filtered the "0 lines" and genes with very few counts, thus I keep 2 genes sets with either 39311 or 19533 genes.
3/ What do you think about PVclust package for this task?
4/ What do you use for aggregation link and "distance measurement"? I use the average aggregation and measure the correlation.
5/ Until now, I considered log10(normalized_counts+1), using normalized counts from edgeR and DESeq2. The results are very similar, I will keep on with edgeR. Do you think it can be interesting to use rlog or VSD transformation, as recommended in the DESeq2 vignette, on the normalized expression values of egdeR?
6/ I read that normalized counts from edgeR are not recommended to work with, but rather the cpm values. I got very similar results using normalized counts of DESeq2 and edgeR. I feel it is fine to use them for this purpose. What do you think?
If you could provide some help for some of these questions, it would help me a lot.
Thank you in advance.
I think Jane is referring to the bootstrap probability values which are generated by pvclust, next to the approximately unbiased p-values. See http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/
I have used pvclust for microarray data once, because a peer-reviewer wanted to see the bootstrap p-values in my dendrograms.
Ah. Well, drawing from my hazy memories of undergraduate phylogenetics, I think bootstrapping assumes that the features that you're sampling from (to generate your bootstrap replicates) are independent. This would probably not be the case for genes in RNA-seq data, where you get dependencies due to pathways and regulation and whatnot.
Yes, I meant probability values generated by pvclust.
Thank you b.nota
Thank you a lot for your responses Aaron.
I add some precisions:
1. B.nota answered correctly.
4. Yes, sorry, I talk about hierarchical clustering.
5. I got what I call normalized counts from edgeR like this:
From the literature, I have the impression that bootstrapping was more used with microarray (as b.nota was asked to do) than with RNASeq. There are probably not more dependencies on microarray than with RNASeq (except there are more analysed features). I hope to get more feedbacks from people with large experience in clustering and PVClust users.