I created a PCA plot for our RNAseq count dataset following the instructions in the vignette, using r log transformation. Though my plot got generated, I got this warning message when I called the rlog function:
In sparseTest(counts(object, normalized = TRUE), 0.9, 100, 0.1) :
the rlog assumes that data is close to a negative binomial distribution, an assumption
which is sometimes not compatible with datasets where many genes have many zero counts
despite a few very large counts.
In this data, for 15.9% of genes with a sum of normalized counts above 100, it was the case
that a single sample's normalized count made up more than 90% of the sum over all samples.
the threshold for this warning is 10% of genes. See plotSparsity(dds) for a visualization of this.
We recommend instead using the varianceStabilizingTransformation or shifted log (see vignette).
So if I understand this correctly, in genes with sum of normalized count > 100, there is a very large count value (from a single sample) that accounts for over 90% of the sum of normalized count value.
However, I am not sure if this matters while doing a PCA analysis? I tried doing the PCA with both rld and vsd transformed data and the plots look very different. So could you help me understand which method is preferred/recommended in this case?
I have RNAseq count data from HTSeq counts. There are 6 replicates each in the control and affected group and I am interested in looking at the differentially expressed genes between the 2 groups. I am doing the PCA as more of a quality assessment step, to see if there are any outlier samples in the set. For my heatmaps, I use the vsd transformed data.