I try to perform clustering and bootstrapping of 22 samples based on their gene expression. These samples were sequenced by RNASeq, polyA+ selection. I have several questions concerning what I did and what I could try.
1/ I do not find a lot of bootstrapping of RNASeq samples in the literature. Is it done routinely and not published or for some reasons, bootstrapping is not useful for RNASeq samples?
2/ Should we keep the "zero lines", lines with only 0 counts for all samples, for clustering purpose? In my case, I examined 63568 genes (Gencod annotation). I filtered the "0 lines" and genes with very few counts, thus I keep 2 genes sets with either 39311 or 19533 genes.
3/ What do you think about PVclust package for this task?
4/ What do you use for aggregation link and "distance measurement"? I use the average aggregation and measure the correlation.
5/ Until now, I considered log10(normalized_counts+1), using normalized counts from edgeR and DESeq2. The results are very similar, I will keep on with edgeR. Do you think it can be interesting to use rlog or VSD transformation, as recommended in the DESeq2 vignette, on the normalized expression values of egdeR?
6/ I read that normalized counts from edgeR are not recommended to work with, but rather the cpm values. I got very similar results using normalized counts of DESeq2 and edgeR. I feel it is fine to use them for this purpose. What do you think?
If you could provide some help for some of these questions, it would help me a lot.
Thank you in advance.