Dear Community,
i would like to ask a more general question about data transformation methodologies, regarding raw RNA-Seq data for exploratory data analysis, and not for DE expression. In detail, i have identified a small 38 gene signature in a specific type of cancer/TCGA dataset, which shows some interesting results about survival, etc. I want next to test this signature to different types of TCGA datasets, to inspect the pattern of the expression of the genes, and also from clustering to compare the survival estimates of any resulted clusters. For my current downloaded TCGA dataset that i would like to test, i have raw HTSEQ counts for 371 cancer samples. Thus:
1) Which type of transformation or transformations should i follow ? For instance the simple log2(counts +1), which might have a negative impact on very low counts ? Or a more "robust" approach coulbe be implemented ? For example, variance stabilizing transformation from DESeq2 would be "enough ? Or i could alternatively try the cpm transformation from edgeR, although they present some distinct characteristics ?
2) If i would like to use the cpm transformation, i should apply it on the raw counts ? For example:
logCPM.counts <- cpm(dt, prior.count=2, log=TRUE)
dt must be a matrix of raw counts, or should be a DGElist object after TMM normalization in order to account also for sequencing depth ?
Thank you in advance,
Efstathios-Iason
Dear Aaron,
thank you for your valuable comment about the extra benefit of implementing also the TMM normalization for the downstream EDA analyses. I will also read in detail the relative paper. One small comment i would like to add:
for the relative construction of the DGElist object, i could just sapply the raw counts ?
count.dat <- DGEList(counts=x)
as i don't have any important phenotype information, and i want to see only how the cancer samples are clustered and separated based on the signature ?
Yes, that is fine,
calcNormFactors
andcpm
don't care about the groupings.Aaron, sorry to return again for this matter, but unfortunately i noticed various negative values in genes, not in all samples but with various frequencies:
even when i subsetted my matrix to the needed 38 genes, from the following histogram of the relative log2 cpm values i still get negative values in various genes:
https://www.dropbox.com/s/ckbv30uo93ns7n4/Histogram.38genes.jpeg?dl=0
Thus, in your opinion this could oppose a problem and i should increase the prior.count from above ? for instance making it 5 or 6 ?
Or i should move directly to heatmap creation and clustering, and even with row scaling ?
Why are negative values a problem?
Hello Ryan, i had an initial "naive" thought if with increasing the prior.count argument, could reduce the variability of the logCPM values for genes with low counts, but perhaps this could shrunk the values close to zero ? My concern, was for clustering and heatmap creation, if the negative values could be a problem.
Negative logCPM values are no problem at all. Why would they be?
You can certainly increase the prior.count to say 5, as Aaron has told you above. Any value in the range 2-5 will usually perform perfectly well. Whether you have negative values or not is, however, of no importance at all.
Dear Gordon, thank you for your comment-actually i was confused in the beginning regarding the appropriate interpetation of what logCPM values actually represent-thus, from my understanding-they measure the "overall expression level" of the each transcript-so any log2 cpm value below < 1 would be negative-
overall, i will try to just increase the prior.count to values like 5, as you and Aaron suggested.
You can increase the prior count to 5 if you want, but you might still get negative logCPM values. This can still happen even for large prior counts when one of the library sizes is much smaller than the others. The specific prior count you use is however unlikely to make much difference to your ultimate analysis.
Now got the complete point-thank you for your extra comment.
Dear Aaron,
sorry to return again on this post, but i would like to add a very important question concering a published paper in which you are the author with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" (https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_19).
So, in the specific part of the above pipeline, in section 3.4 :
"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: "
A) This type of filtering, could be also applied in this scenario i have described for the evaluation of the gene signature ? in the following context ?
B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of
and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above-
Thank you in advance,
Efstathios
This has nothing to do with the original question. Make a new post.
Ok Aaron gotcha