Robust transformation of raw RNA-seq counts for exploratory data analysis and hierarchical clustering
2
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

i would like to ask a more general question about data transformation methodologies, regarding raw RNA-Seq data for exploratory data analysis, and not for DE expression. In detail, i have identified a small 38 gene signature in a specific type of cancer/TCGA dataset, which shows some interesting results about survival, etc. I want next to test this signature to different types of TCGA datasets, to inspect the pattern of the expression of the genes, and also from clustering to compare the survival estimates of any resulted clusters. For my current downloaded TCGA dataset that i would like to test, i have raw HTSEQ counts for 371 cancer samples. Thus:

1) Which type of transformation or transformations should i follow ? For instance the simple log2(counts +1), which might have a negative impact on very low counts ? Or a more "robust" approach coulbe be implemented ? For example, variance stabilizing transformation from DESeq2 would be "enough ? Or i could alternatively try the cpm transformation from edgeR, although they present some distinct characteristics ?

2) If i would like to use the cpm transformation, i should apply it on the raw counts ? For example:

logCPM.counts <- cpm(dt, prior.count=2, log=TRUE)

dt must be a matrix of raw counts, or should be a DGElist object after TMM normalization in order to account also for sequencing depth ?

Thank you in advance,

Efstathios-Iason

rnaseq analysis variancestabilizingtransformation normalization cpm edger • 3.4k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

I don't see a problem with using cpm with log=TRUE and a large prior.count (3-5). The log-transformation provides some measure of variance stabilisation for count data, with the added bonus that differences between log-values directly represent log-fold changes (which is what we're usually interested in anyway). The large prior count shrinks log-differences towards zero to reduce the influence of small counts.

If you can apply cpm on the raw counts, it will automatically use the column sums as the library sizes. If you apply cpm on a DGEList after TMM normalization, it will use the effective library sizes, i.e., the product of the library size and normalization factor. Both approaches account for sequencing depth, but the latter also accounts for composition biases - check out the TMM paper. It doesn't take much effort, so just do it.

ADD COMMENT
0
Entering edit mode

Dear Aaron,

thank you for your valuable comment about the extra benefit of implementing also the TMM normalization for the downstream EDA analyses. I will also read in detail the relative paper. One small comment i would like to add:

for the relative construction of the DGElist object, i could just sapply the raw counts ?

count.dat <- DGEList(counts=x)

as i don't have any important phenotype information, and i want to see only how the cancer samples are clustered and separated based on the signature ?

ADD REPLY
1
Entering edit mode

Yes, that is fine, calcNormFactors and cpm don't care about the groupings.

ADD REPLY
0
Entering edit mode

Aaron, sorry to return again for this matter, but unfortunately i noticed various negative values in genes, not in all samples but with various frequencies:

lihc_filt <- TCGAanalyze_Filtering(tabDF = assay(lihc.exp),
method = "quantile",qnt.cut =  0.25) # a small initial filtering from the R package TCGAbiolinks

y <- DGEList(counts = lihc_filt)

logCPM.counts <- cpm(y, prior.count=2, log=TRUE)
head(logCPM.counts)

        TCGA-DD-AAE4-01A-11R-A41C-07 TCGA-BD-A3EP-01A-11R-A22L-07
TSPAN6                      5.8497451                     6.809613
TNMD                       -3.3093104                    -3.884246
         TCGA-DD-AAW1-01A-11R-A41C-07 TCGA-5R-AAAM-01A-12R-A41C-07
TSPAN6                       8.219309                    6.2252642
TNMD                        -4.035423                   -4.5546560
         TCGA-DD-A4NO-01A-11R-A28V-07 TCGA-G3-A3CK-01A-11R-A213-07
TSPAN6                       7.408457                    5.8443492
TNMD                        -4.009086                   -2.1019193
         TCGA-DD-A4NS-01A-11R-A311-07 TCGA-CC-A7II-01A-11R-A33J-07
TSPAN6                       5.945970                     6.495122
TNMD                        -3.969267                    -4.554656

even when i subsetted my matrix to the needed 38 genes, from the following histogram of the relative log2 cpm values i still get negative values in various genes:

https://www.dropbox.com/s/ckbv30uo93ns7n4/Histogram.38genes.jpeg?dl=0

Thus, in your opinion this could oppose a problem and i should increase the prior.count from above ? for instance making it 5 or 6 ?

Or i should move directly to heatmap creation and clustering, and even with row scaling ?

ADD REPLY
0
Entering edit mode

Why are negative values a problem?

ADD REPLY
0
Entering edit mode

Hello Ryan, i had an initial "naive" thought if with increasing the prior.count argument, could reduce the variability of the logCPM values for genes with low counts, but perhaps this could shrunk the values close to zero ? My concern, was for clustering and heatmap creation, if the negative values could be a problem.

ADD REPLY
1
Entering edit mode

Negative logCPM values are no problem at all. Why would they be?

You can certainly increase the prior.count to say 5, as Aaron has told you above. Any value in the range 2-5 will usually perform perfectly well. Whether you have negative values or not is, however, of no importance at all.

ADD REPLY
0
Entering edit mode

Dear Gordon, thank you for your comment-actually i was confused in the beginning regarding the appropriate interpetation of what logCPM values actually represent-thus, from my understanding-they measure the "overall expression level" of the each transcript-so any log2 cpm value below < 1 would be negative-

overall, i will try to just increase the prior.count to values like 5, as you and Aaron suggested.

ADD REPLY
1
Entering edit mode

You can increase the prior count to 5 if you want, but you might still get negative logCPM values. This can still happen even for large prior counts when one of the library sizes is much smaller than the others. The specific prior count you use is however unlikely to make much difference to your ultimate analysis.

ADD REPLY
0
Entering edit mode

Now got the complete point-thank you for your extra comment.

ADD REPLY
0
Entering edit mode

Dear Aaron,

sorry to return again on this post, but i would like to add a very important question concering a published paper in which you are the author with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" (https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_19).

So, in the specific part of the above pipeline, in section 3.4 :

"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: " 

cpm(10, mean(y$samples$lib.size))

A) This type of filtering, could be also applied in this scenario i have described for the evaluation of the gene signature ? in the following context ? 

y <- DGEList(counts = lihc.exp) # original annotated raw counts

cpm.filter <- cpm(10, mean(y$samples$lib.size))

expressed <- rowSums (cpm(y) > cpm.filter) >=N/2 # where N the total number of samples

y2 <- y[expressed, , keep.lib.sizes=FALSE]

y2 <- calcNormFactors(y2,method="TMM")

logCPM.counts <- cpm(y2, prior.count=5, log=TRUE)....

B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of 

cpm.filter <- cpm(10, mean(y$samples$lib.size)) ? 

and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above-

Thank you in advance,

Efstathios

 

ADD REPLY
1
Entering edit mode

This has nothing to do with the original question. Make a new post.

ADD REPLY
0
Entering edit mode

Ok Aaron gotcha

ADD REPLY
0
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…

Interesting thread and good discussion. I'd like to make two additional remarks:

1.) Aaron said (emphases from me) "The log-transformation provides some measure of variance stabilisation for count data, with the added bonus that differences between log-values directly represent log-fold changes (which is what we're usually interested in anyway)." But:

  • log(n_1 + c) - log(n_2 + c), where c is the prior count, is not a log fold change between the counts n_1 and n_2.
  • Approximately, and in particular for large counts, it is true that c becomes negligible and this quantity approaches log(n_1) - log(n_2) = log (n_1 / n_2). However, that is equally true for DESeq2's VST, since this transformation is approximately the same as the logarithm (base 2) for large counts.

Thus, differences between log-transformed values after prior count addition are as good or bad to interpret as differences between variance stabilization transformed values.

2.) There seems to be some anxiety in this thread about the right choice of the offset c. This is exactly what a variance stabilizing transformation (VST) automates, by the criterion of choosing the parameter such as to stabilize the variance as well as possible.

The VST is a more principled, more explicit alternative to the 'pseudocounts' or 'prior counts' fudge.

 

ADD COMMENT

Login before adding your answer.

Traffic: 773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6