Search
Question: Robust transformation of raw RNA-seq counts for exploratory data analysis and hierarchical clustering
0
6 months ago by
svlachavas610
Greece/Athens/National Hellenic Research Foundation
svlachavas610 wrote:

Dear Community,

i would like to ask a more general question about data transformation methodologies, regarding raw RNA-Seq data for exploratory data analysis, and not for DE expression. In detail, i have identified a small 38 gene signature in a specific type of cancer/TCGA dataset, which shows some interesting results about survival, etc. I want next to test this signature to different types of TCGA datasets, to inspect the pattern of the expression of the genes, and also from clustering to compare the survival estimates of any resulted clusters. For my current downloaded TCGA dataset that i would like to test, i have raw HTSEQ counts for 371 cancer samples. Thus:

1) Which type of transformation or transformations should i follow ? For instance the simple log2(counts +1), which might have a negative impact on very low counts ? Or a more "robust" approach coulbe be implemented ? For example, variance stabilizing transformation from DESeq2 would be "enough ? Or i could alternatively try the cpm transformation from edgeR, although they present some distinct characteristics ?

2) If i would like to use the cpm transformation, i should apply it on the raw counts ? For example:

logCPM.counts <- cpm(dt, prior.count=2, log=TRUE)

dt must be a matrix of raw counts, or should be a DGElist object after TMM normalization in order to account also for sequencing depth ?

Efstathios-Iason

modified 6 months ago by Wolfgang Huber13k • written 6 months ago by svlachavas610
2
6 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

I don't see a problem with using cpm with log=TRUE and a large prior.count (3-5). The log-transformation provides some measure of variance stabilisation for count data, with the added bonus that differences between log-values directly represent log-fold changes (which is what we're usually interested in anyway). The large prior count shrinks log-differences towards zero to reduce the influence of small counts.

If you can apply cpm on the raw counts, it will automatically use the column sums as the library sizes. If you apply cpm on a DGEList after TMM normalization, it will use the effective library sizes, i.e., the product of the library size and normalization factor. Both approaches account for sequencing depth, but the latter also accounts for composition biases - check out the TMM paper. It doesn't take much effort, so just do it.

Dear Aaron,

thank you for your valuable comment about the extra benefit of implementing also the TMM normalization for the downstream EDA analyses. I will also read in detail the relative paper. One small comment i would like to add:

for the relative construction of the DGElist object, i could just sapply the raw counts ?

count.dat <- DGEList(counts=x)

as i don't have any important phenotype information, and i want to see only how the cancer samples are clustered and separated based on the signature ?

1

Yes, that is fine, calcNormFactors and cpm don't care about the groupings.

Aaron, sorry to return again for this matter, but unfortunately i noticed various negative values in genes, not in all samples but with various frequencies:

lihc_filt <- TCGAanalyze_Filtering(tabDF = assay(lihc.exp),
method = "quantile",qnt.cut =  0.25) # a small initial filtering from the R package TCGAbiolinks

y <- DGEList(counts = lihc_filt)

logCPM.counts <- cpm(y, prior.count=2, log=TRUE)

TCGA-DD-AAE4-01A-11R-A41C-07 TCGA-BD-A3EP-01A-11R-A22L-07
TSPAN6                      5.8497451                     6.809613
TNMD                       -3.3093104                    -3.884246
TCGA-DD-AAW1-01A-11R-A41C-07 TCGA-5R-AAAM-01A-12R-A41C-07
TSPAN6                       8.219309                    6.2252642
TNMD                        -4.035423                   -4.5546560
TCGA-DD-A4NO-01A-11R-A28V-07 TCGA-G3-A3CK-01A-11R-A213-07
TSPAN6                       7.408457                    5.8443492
TNMD                        -4.009086                   -2.1019193
TCGA-DD-A4NS-01A-11R-A311-07 TCGA-CC-A7II-01A-11R-A33J-07
TSPAN6                       5.945970                     6.495122
TNMD                        -3.969267                    -4.554656

even when i subsetted my matrix to the needed 38 genes, from the following histogram of the relative log2 cpm values i still get negative values in various genes:

https://www.dropbox.com/s/ckbv30uo93ns7n4/Histogram.38genes.jpeg?dl=0

Thus, in your opinion this could oppose a problem and i should increase the prior.count from above ? for instance making it 5 or 6 ?

Or i should move directly to heatmap creation and clustering, and even with row scaling ?

Why are negative values a problem?

Hello Ryan, i had an initial "naive" thought if with increasing the prior.count argument, could reduce the variability of the logCPM values for genes with low counts, but perhaps this could shrunk the values close to zero ? My concern, was for clustering and heatmap creation, if the negative values could be a problem.

1

Negative logCPM values are no problem at all. Why would they be?

You can certainly increase the prior.count to say 5, as Aaron has told you above. Any value in the range 2-5 will usually perform perfectly well. Whether you have negative values or not is, however, of no importance at all.

Dear Gordon, thank you for your comment-actually i was confused in the beginning regarding the appropriate interpetation of what logCPM values actually represent-thus, from my understanding-they measure the "overall expression level" of the each transcript-so any log2 cpm value below < 1 would be negative-

overall, i will try to just increase the prior.count to values like 5, as you and Aaron suggested.

1

You can increase the prior count to 5 if you want, but you might still get negative logCPM values. This can still happen even for large prior counts when one of the library sizes is much smaller than the others. The specific prior count you use is however unlikely to make much difference to your ultimate analysis.

Now got the complete point-thank you for your extra comment.

Dear Aaron,

sorry to return again on this post, but i would like to add a very important question concering a published paper in which you are the author with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" (https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_19).

So, in the specific part of the above pipeline, in section 3.4 :

"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: "

cpm(10, mean(y$samples$lib.size))

A) This type of filtering, could be also applied in this scenario i have described for the evaluation of the gene signature ? in the following context ?

y <- DGEList(counts = lihc.exp) # original annotated raw counts

cpm.filter <- cpm(10, mean(y$samples$lib.size))

expressed <- rowSums (cpm(y) > cpm.filter) >=N/2 # where N the total number of samples

y2 <- y[expressed, , keep.lib.sizes=FALSE]

y2 <- calcNormFactors(y2,method="TMM")

logCPM.counts <- cpm(y2, prior.count=5, log=TRUE)....

B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of

cpm.filter <- cpm(10, mean(y$samples$lib.size)) ?

and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above-

Efstathios

1

This has nothing to do with the original question. Make a new post.

Ok Aaron gotcha

0
6 months ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Interesting thread and good discussion. I'd like to make two additional remarks:

1.) Aaron said (emphases from me) "The log-transformation provides some measure of variance stabilisation for count data, with the added bonus that differences between log-values directly represent log-fold changes (which is what we're usually interested in anyway)." But:

• log(n_1 + c) - log(n_2 + c), where c is the prior count, is not a log fold change between the counts n_1 and n_2.
• Approximately, and in particular for large counts, it is true that c becomes negligible and this quantity approaches log(n_1) - log(n_2) = log (n_1 / n_2). However, that is equally true for DESeq2's VST, since this transformation is approximately the same as the logarithm (base 2) for large counts.

Thus, differences between log-transformed values after prior count addition are as good or bad to interpret as differences between variance stabilization transformed values.

2.) There seems to be some anxiety in this thread about the right choice of the offset c. This is exactly what a variance stabilizing transformation (VST) automates, by the criterion of choosing the parameter such as to stabilize the variance as well as possible.

The VST is a more principled, more explicit alternative to the 'pseudocounts' or 'prior counts' fudge.