I'm using edgeR (in the Trinity-RNA Seq Pipeline) to do a differential expression analysis and when I try to make a heatmap I got the following error:
> gene_cor = NULL
> gene_dist = dist(data, method='euclidean')
> if (nrow(data) <= 1) { message('Too few genes to generate heatmap'); quit(status=0); }
> hc_genes = hclust(gene_dist, method='complete')
Error in hclust(gene_dist, method = "complete") :
size cannot be NA nor exceed 65536
Execution halted
In this case I found over 80k DEGs and I can't get make it work. However if I re-run all but with more strict parameter I get less DEGs and the error don't show up.
This is only tangentially related to Bioconductor, and certainly has nothing to do with Bioc packages, so you should probably be asking about this on R-help (r-help@r-project.org).
That said, if you are aligning against a de novo transcriptome, you should note that Trinity turns everything into a transcript, so there are likely to be lots of 'transcripts' that are probably not representative of anything real, that you need to filter out prior to doing any comparisons. There are lots of ways to do this; one way that has been recommended on this site before (by Ryan Thompson, IIRC) that I sort of like is to plot the distribution of the rowSums of the logCPM values. This is usually a bimodal distribution, and it usually seems pretty safe to select some cutoff that excludes the hump on the left and keeps the hump on the right.
That should get you down to a tractable number of genes to compare.
James has suggested that you filter out low expressed transcripts. That is very sensible but I doubt it will be sufficient in itself and you should have already done this sort of filtering anyway before the DE analysis. I would suggest you use edgeR's glmTreat() function to choose a smaller number of transcripts with larger fold changes changes. You can't see tens of thousands of genes on a heatmap anyway. It's far more sensible to plot a carefully chosen subset of transcripts. Otherwise what are you doing the heatmap for?
For starters, this isn't an edgeR question. It's hclust (from the stats package) that's throwing the error.
As for the error, it pretty much explains itself. You have too many genes and hclust doesn't want to deal with the resulting distance matrix. Maybe Rclusterpp might be able to handle it without using up too much memory.
I should also point out that 80K is a lot of DEGs. One can only pity the poor soul who has to make sense of it. There's a reason why most people just do DE analyses on gene counts if a reference genome is available.
Thanks for the quick response! I will try that!
James has suggested that you filter out low expressed transcripts. That is very sensible but I doubt it will be sufficient in itself and you should have already done this sort of filtering anyway before the DE analysis. I would suggest you use edgeR's glmTreat() function to choose a smaller number of transcripts with larger fold changes changes. You can't see tens of thousands of genes on a heatmap anyway. It's far more sensible to plot a carefully chosen subset of transcripts. Otherwise what are you doing the heatmap for?