Question

Heatmap only including DE genes?

0

Entering edit mode

Raito92 ▴ 60

@raito92-20399

Last seen 3.5 years ago

Italy

Good morning to everyone, I'm trying to generate heatmap to show the results of DE genes found at the end of my analysis (performed with the workflows RNASeqGeneEdgeRQL and rnaseqgene which rely on edgeR quasi-likelihood and DESEQ2 respectively). In the first case, I'm using the function coolmap, in the second pheatmap, the results look like this:

https://i.ibb.co/yXD6T7w/HeatMap1.png

https://i.ibb.co/MVfhwgb/HeatMap2.png

The heatmap apparently includes ALL transcripts I'm working on, not just the DE (on which I applied a threshold of logFC > 1.5) comparing a single condition.

Maybe I'm missing something.

Is it possible to generate a heatmap including only the restricted list of genes obtained after performing logFC filering? Would it make sense at all or it wouldn't be possible to make comparisons and clusterize genes like in a heatmap? Thanks in advance.

Thanks in advance

heatmap coolmap pheatmap • 4.5k views

ADD COMMENT • link updated 5.7 years ago by Gordon Smyth 53k • written 5.7 years ago by Raito92 ▴ 60

score 1 · Answer 1 · 2020-04-15

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 8 days ago

The Cave, 181 Longwood Avenue, Boston, …

Hi,

This is not quite an issue with any Bioconductor package and is more a matter of general coding in R. You simply need to subset your input data matrix to have only those genes that you want to include for the purpose of clustering and heatmap generation. This can be done directly or indirectly in many different ways, for example, via subset(), match(), which(), and others. There are undoubtedly many tutorials and code snippets online about this.

Keep in mind that clustering using the entire unfiltered dataset is generally referred to as 'unsupervised clustering'. On the other hand, when you 'supervise' the clustering by selecting only those genes that are, for example, statistically significantly differentially expressed, it is referred to as 'supervised clustering'.

Kevin

ADD COMMENT • link 5.7 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Hello, thanks for your answer!

You simply need to subset your input data matrix to have only those genes that you want to include for the purpose of clustering and heatmap generation

I guessed that, but what confused me is that I would have expected the pipeline itself to work on DE transcripts as a subset and not apparently them all...

This can be done directly or indirectly in many different ways, for example, via subset(), match(), which(), and others. There are undoubtedly many tutorials and code snippets online about this.

Thanks for the suggestion, I'll check them out! Would you suggest a specific tutorial/function in my case?

ADD REPLY • link 5.7 years ago Raito92 ▴ 60

score 1 · Answer 2 · 2020-04-15

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 13 hours ago

WEHI, Melbourne, Australia

The RNASeqGeneEdgeRQL workflow shows you how to make the heatmap only of DE genes.

Note that coolmap simply plots a matrix of expression values. If you give it 50 rows/genes of data, then it will plot 50 genes. If you give it data for 300 genes, it will plot 300 genes. You have to choose the genes yourself -- the function has no way of sensing which of your genes are significantly DE. If the plot shows different genes to what you expected, then you have input the wrong data to it, i.e., you've made a programming error.

ADD COMMENT • link 5.7 years ago Gordon Smyth 53k

0

Entering edit mode

Hello, thanks for your answer!

I guessed that, even though what I can’t understand is…

If my glmTreat function outputs (while using summary.de), by applying a given threshold, let’s say, 70 up-regulated and 30 down-regulated transcripts (and several thousands of unsigned ones), why does the heatmap function coolmap (which takes in input the object ‘tr’ which contains the result of the function glmTreat) shows more than 100 genes? (I asked to show the first 150... in different heatmaps with 50 transcripts each one).

Thanks in advance if you clarify this...

ADD REPLY • link 5.7 years ago Raito92 ▴ 60

0

Entering edit mode

coolmap does not take tr as input.

I am find your questions hard to understand for a number of reasons:

The heatmaps you show display only a handful of genes and of them appear to be DE. So it appears you have already done what you say you want to do. I don't understand why you would claim that the heatmaps show all genes.
The purpose of this forum is to help you with code usage but you don't show any code, so it is impossible for us to tell you what you have done wrong, if indeed you have done anything wrong.
The RnaSeqGeneEdgeRQL workflow already seems to answer your question in detail. I'm not sure why you don't simply follow the workflow.
If you want help with the a function (like coolmap) then the first thing to do would be to read the function help page (help("coolmap")). The help page would answer any questions you have about input to coolmap and what it displays.

ADD REPLY • link 5.7 years ago Gordon Smyth 53k

0

Entering edit mode

Hello again, I'll try to be clearer this time by providing the exact code I used and answering point-by-point

coolmap does not take tr as input.

Not directly, but it seems to use the object tr to order the result by ... Pvalue? I'm referring to the o <- ....line.

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[51:100],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

The heatmaps you show display only a handful of genes and of them appear to be DE. So it appears you have already done what you say you want to do. I don't understand why you would claim that the heatmaps show all genes.

The RnaSeqGeneEdgeRQL workflow already seems to answer your question in detail. I'm not sure why you don't simply follow the workflow.

Maybe it's just me, but there is something that I don't understand well, even though I've tried to follow the workflow.

The purpose of this forum is to help you with code usage but you don't show any code, so it is impossible for us to tell you what you have done wrong, if indeed you have done anything wrong.

As I said, I'll try to be clearer this time! I'm copying here the exact code I used, adapted from the workflow for my analysis, comparing the two conditions Marzo.Carica and Marzo.Scarica and using for now lfc = 1 as a test. The last line shows the number of DE genes found.

B.LvsP <- makeContrasts(Marzo.Carica-Marzo.Scarica, levels=design)
tr <- glmTreat(fit, contrast=B.LvsP, lfc=1)
is.de <- decideTestsDGE(tr, adjust.method="none")
summaryis.de)

Number of DE genes found:

       1*Marzo.Carica -1*Marzo.Scarica
Down                               161
NotSig                           36302
Up                                 106

So, my question is... I should expect my heatmap to show ONLY these 267 differential genes and then give me a sort of error, if I try to show more genes, shouldn't it? This is what I can't figure out, because this doesn't seem to be the case.

I've tried to use the following code to show 50 genes per heatmap (otherwise the names wouldn't be readable) and show heatmaps showing genes from 1 to 300. Don't mind that I chose norm.factors to appear, it's just because the name of the samples are too long and impossible to tell apart and only a temporary change.

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$norm.factors)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[1:50],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$norm.factors)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[51:100],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$norm.factors)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[101:150],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$norm.factors)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[151:200],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$norm.factors)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[201:250],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$norm.factors)
o <- order(tr$table$PValue)
logCPM <- logCPM[o[251:300],]
coolmap(logCPM, margins=c(7,7), lhei=c(1,6), lwid=c(1,3))

I wouldn't have expected the last heatmap to be shown completely, but 50 genes are there even though only 267 DE ones are found by decideTestsDGE... not 300...

Thanks in advance, maybe there is something obvious I'm missing (maybe the heatmap shows all genes but ordered by Pvalue?)

ADD REPLY • link 5.7 years ago Raito92 ▴ 60

1

Entering edit mode

is.de <- decideTestsDGE(tr, adjust.method="none")

You should always use a p-value adjustment method.

--------------------------------------

So, my question is... I should expect my heatmap to show ONLY these 267 differential genes and then give me a sort of error, if I try to show more genes, shouldn't it? This is what I can't figure out, because this doesn't seem to be the case.

No, you should not expect that. You are seeing 267 there because the default threshold values for decideTestsDGE() are p.value=0.05 and lfc=0. However, when you run this following line of code, you are accessing the entire vector of p-values, i.e., those above and below p=0.05:

o <- order(tr$table$PValue)

If you want a heatmap of genes with nominal p<0.05, then start by doing something like this:

subset(tr$table, PValue < 0.05)

..and then proceed from there.

Make sure that you take time to view the values of variables in your workspace so that you can better understand what is happening.

ADD REPLY • link 5.7 years ago Kevin Blighe ★ 4.0k

2

Entering edit mode

Kevin, thanks for replying, but the code given by OP actually does plot only 50 genes. The heatmap code is completely independent of the decideTests call.

OP's claims make no sense to me. Note that he has not claimed to see 267 genes (as you assume in your comment) but rather he claims to see 300 or more. The heatmap picture shown in the question shows about 60 genes, so it doesn't match either the code or the claims.

ADD REPLY • link 5.7 years ago Gordon Smyth 53k

0

Entering edit mode

That's right, I forgot to repost the new heatmaps as obtained from the code.

https://i.ibb.co/W5Lt8N5/1.png

https://i.ibb.co/CQrbKFf/2.png

https://i.ibb.co/cXBx9WB/3.png

https://i.ibb.co/0QcgKpG/4.png

https://i.ibb.co/YTL2r6M/5.png

https://i.ibb.co/BqpwssW/6.png

So apparently the heatmaps won't just show the 267 DE genes found by summaryis.de), but as shown in 6.png (ranging from 251 to 300) there are more. That's what I needed clarifications about ^^' Maybe it just works on the whole gene set and orders them by Pvalue?

ADD REPLY • link 5.7 years ago Raito92 ▴ 60

0

Entering edit mode

It is important to get the initial question correct and write it in such a way such that any answer will fully address the points made [in your question]. When threads continue in this lengthy fashion, information becomes 'diluted' and it makes it more difficult for future users to find what they need. Can you take some time to go over all of your data and code, and read our replies here? Thanks / Grazie mile.

ADD REPLY • link 5.7 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

is.de <- decideTestsDGE(tr, adjust.method="none")

Thanks, I'll try with BH probably.

However, when you run this following line of code, you are accessing the entire vector of p-values, i.e., those above and below p=0.05:

I see!

If you want a heatmap of genes with nominal p<0.05, then start by doing something like this:

subset(tr$table, PValue < 0.05) ..and then proceed from there.

This is exactly what I wanted to know, I'll try with this different subset and see how it works, thanks!

ADD REPLY • link 5.7 years ago Raito92 ▴ 60