Question

Clustering differentially expressed genes in response to multiple treatments (using edgeR)

0

Entering edit mode

eggrandio • 0

@eggrandio-14403

Last seen 3.6 years ago

United States

Hello,

I am analyzing RNA-seq data of arabidopsis samples in response to three treatments (let's call them WT, RT and DT). These treatments have some components in common, so I want to find genes that respond specifically to the unique components in each treatment.

In summary, I have performed the Differential Expression analysis of the three treatments versus the non-treated sample (NT), and I would like to cluster the DE genes in groups of genes that respond similarly to the three treatments and genes that only respond in one of the treatments. I cannot distinguish clearly these groups of genes, and I wanted to make sure that I am not missing something. I can "find" these genes by doing manual filtering (see at the end), but I would like to make a figure to visualize them.

This is the first time I am conducting such analysis, so I will describe what I have done step by step in case I have made any mistake. You can skip directly to step 7 for the clustering analysis. I have included more detailed questions in bold.

The steps I am not sure about are, z-scoring the RPKM data, and clustering methods.

1- Alignment & feature counts (I used galaxy.org):

First, I ran FastQC and Trimmomatic to remove bad quality reads. Then I run HISAT2 to align the reads and for the sake of simplicity, I ran featurecounts to count the reads assigned to each gene (I have done also the counts per transcript in a separate analysis).

2- Creating DGE object using edgeR:

library(edgeR)
library(org.At.tair.db)
library(tidyverse)
library(zFPKM)
library(ComplexHeatmap)

input="./data/Feature_count.txt"
feature.length="./data/Feature_length.txt"

counts = read.delim(input, row.names = 1) #read output from FeatureCounts
GeneLength = (read.delim(feature.length, row.names = 1)) #read file with gene length of genes to calculate RPKM

#I have five replicates for each treatment:
head(counts[1:5])
           NT1  NT2   NT3  NT4  NT5
AT1G01010   48   51    97   24   55
AT1G01020  204  230   172  180  227
AT1G01030   91   89    59   79  115
AT1G01040 2897 2954  2915 2113 2762
AT1G01050 1011 1030  1140 1018 1137
AT1G01060 8784 9078 10806 8453 9749

#Create the treatment groups (NT - non-treated, W, R and D are the three different treatments):
snames = colnames(counts)
treatment = as.factor(substr(snames, 1, 2)) #remove replicate number
treatment = factor(treatment, levels=c("NT","WT","RT","DT"))

#Create Differential Gene Expression List Object (DGEList) object:
d0 = DGEList(counts = counts,
             genes = data.frame(Length=GeneLength),
             group = treatment)
d0 = calcNormFactors(d0)

3- Filter DGE object to remove genes with low read count:

RPKM_cutoff = 1

#get index of genes below threshold
drop = d0 %>% rpkm %>% rowMeans %>% `<`(RPKM_cutoff) %>% which

#Remove genes with low read counts
d = d0[-drop, ,keep.lib.sizes=FALSE]

d = calcNormFactors(d)

4- MDS plot of samples. WT and RT group together, while DT and NT seem to be more different. DT is the treatment I am most interested in.

5- Compare all samples against control (non treated).

design = model.matrix(~treatment)

d = estimateDisp(d, design, robust=TRUE)

fit = glmQLFit(d, design, robust=TRUE)

qlf = glmQLFTest(fit, coef=2:4) #compares all the samples to the intercept (control)

AllvsCtrl = topTags(qlf, n = Inf) #unfiltered data

6- Select DE genes with FDR<0.05 and log2(Fold change) greater than 1 in at least one comparison:

FDRts = 0.05 #set FDR threshold

Filtered_byFDR = topTags(qlf, n = Inf, p.value = FDRts)
FCts = log2(2) #set log2(Fold change) threshold

tmp = AllvsCtrl$table %>% select(c(logFC.treatmentWT,logFC.treatmentRT,logFC.treatmentDT)) %>%
  as.matrix %>% abs %>% rowMax %>% `>`(FCts) %>% which #Select only genes with abs(logFC)>FCts

Filtered_AllvsCtrl = Filtered_byFDR[tmp,]

head(Filtered_AllvsCtrl$table)
          Length logFC.treatmentWT logFC.treatmentRT logFC.treatmentDT   logCPM        F       PValue
AT1G21240   2422        2.96674871         3.7371969          8.059693 4.035865 576.7402 3.437391e-19
AT3G15536    495        1.33337783         2.7300966          5.819152 2.281665 473.0633 1.508623e-18
AT1G09932   1443        1.88438553         3.3242943          5.707507 3.360015 385.8037 1.093145e-17
AT4G00700   3611        1.13613808         2.6300577          6.300830 4.405263 396.3470 2.928449e-17
AT2G38860   2127       -0.07110813        -0.1787093          3.025977 5.621593 360.9900 2.998005e-17
AT3G13950   1206        1.78454256         2.4608951          6.459634 2.503970 338.3309 3.899023e-17
                   FDR
AT1G21240 5.570980e-15
AT3G15536 1.222513e-14
AT1G09932 5.905536e-14
AT4G00700 9.717734e-14
AT2G38860 9.717734e-14
AT3G13950 1.053191e-13

length(rownames(Filtered_AllvsCtrl$table))
[1] 2705

7- Plot data heatamp to cluster genes by expression pattern (here is where I cannot isolate the gene groups I want): Should I use Z-scored RPKM values? Should I use the log2(FC)?

#Convert RPKM to Z-scores:

RPKMd = as.data.frame(rpkm(d))

colnames(RPKMd) = treatment

Should I average RPKMs between replicates before doing Z-score or after?

RPKMd = sapply(split.default(RPKMd, names(RPKMd)), rowMeans)

zRPKMd = RPKMd %>% as.data.frame %>% zFPKM %>% as.matrix

#Select z-scored RPKM values of genes with DE in at least one treatment:
Hmat = zRPKMd[rownames(zRPKMd) %in% rownames(Filtered_AllvsCtrl$table),]

7a - From what I have been reading, pearson's correlation should be used to cluster genes with a similar expression pattern. When I do so, I see this weird clustering:

Heatmap(Hmat,
        clustering_distance_rows = "pearson",
        show_row_names = F,
        heatmap_legend_param = list(title = "Z-score"),
        column_title = "Treatments",
        row_title = "DE Genes")

This is how the heatmap looks like using the ComplexHeatmap default distance and clustering methods ("euclidean" distance, "complete" clustering):

Heatmap(Hmat,
        show_row_names = F,
        row_split = 15,
        heatmap_legend_param = list(title = "Z-score"),
        column_title = "Treatments",
        row_title = "DE Genes")

Clusters 6 and 8 (from top) are close to what I am looking for: genes with high expression in DT, and low in WT,RT and NT). Cluster 10 seems to have the opposite pattern, alsto interesting for me.

I think I can work with those groups of genes, but the visualization is not as clear as I would like (maybe I should change the color scale?).

I could also find these kind of genes "manually" doing something like this, but I would like to have a figure to show them, as I also want to include further analyses done on these groups in the same plot.

data = Filtered_AllvsCtrl$table

onlyDT = data[which((data$logFC.treatmentWT < 1) &
             (data$logFC.treatmentRT < 1) &
             (data$logFC.treatmentDT > 1)),]

head(onlyDT)
          Length logFC.treatmentWT logFC.treatmentRT logFC.treatmentDT   logCPM        F       PValue
AT2G38860   2127       -0.07110813        -0.1787093          3.025977 5.621593 360.9900 2.998005e-17
AT3G25020   3121        0.37136843         0.4851968          2.618082 4.437666 288.8188 1.796895e-16
AT1G78290   1868       -0.79963552        -0.8454600          2.135336 4.624130 278.2638 2.608339e-16
AT2G37110   1173        0.37447283         0.9296454          1.626979 6.245539 244.4434 8.942511e-16
AT2G14610    902       -0.56183051        -1.0595904          7.413948 4.322275 306.0732 1.075079e-15
AT1G76970   2476        0.07736142         0.8145889          2.161111 4.203167 234.4904 1.332456e-15
                   FDR
AT2G38860 9.717734e-14
AT3G25020 2.426856e-13
AT1G78290 3.019525e-13
AT2G37110 6.587785e-13
AT2G14610 7.575569e-13
AT1G76970 8.305812e-13

Any help with the analysis workflow or visualization would be welcome!

Thanks!

edger complexheatmap R clustering • 3.6k views

ADD COMMENT • link updated 5.7 years ago by Gordon Smyth 53k • written 5.7 years ago by eggrandio • 0

1

Entering edit mode

Cross-posted on Biostars: https://www.biostars.org/p/435628/

ADD REPLY • link 5.7 years ago Kevin Blighe ★ 4.0k

score 2 · Answer 1 · 2020-04-30

It is much better to make heatmaps of logCPM values (or logRPKMs, which will give the same plot) rather then z-scored RPKM values. See for example the edgeR workflow example:

https://bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html

There are two reasons for this. First, the heatmap values should be on the log-scale so that the same fold-change always comes up in the same color. Second the z-scores for the heatmap should be computed by row, not by column as the zFPKM function is designed to do. Note that most heatmap functions will compute z-scores automatically so you should not need to do that yourself.

For example, to make a heatmap of your top 100 genes:

logCPM <- cpm(d, log=TRUE)
TopGenes <- rownames(Filtered_AllvsCtrl))[1:100]
coolmap( logCPM[TopGenes, ] )