Hello,
I am analyzing RNA-seq data of arabidopsis samples in response to three treatments (let's call them WT, RT and DT). These treatments have some components in common, so I want to find genes that respond specifically to the unique components in each treatment.
In summary, I have performed the Differential Expression analysis of the three treatments versus the non-treated sample (NT), and I would like to cluster the DE genes in groups of genes that respond similarly to the three treatments and genes that only respond in one of the treatments. I cannot distinguish clearly these groups of genes, and I wanted to make sure that I am not missing something. I can "find" these genes by doing manual filtering (see at the end), but I would like to make a figure to visualize them.
This is the first time I am conducting such analysis, so I will describe what I have done step by step in case I have made any mistake. You can skip directly to step 7 for the clustering analysis. I have included more detailed questions in bold.
The steps I am not sure about are, z-scoring the RPKM data, and clustering methods.
1- Alignment & feature counts (I used galaxy.org):
First, I ran FastQC and Trimmomatic to remove bad quality reads. Then I run HISAT2 to align the reads and for the sake of simplicity, I ran featurecounts to count the reads assigned to each gene (I have done also the counts per transcript in a separate analysis).
2- Creating DGE object using edgeR:
library(edgeR)
library(org.At.tair.db)
library(tidyverse)
library(zFPKM)
library(ComplexHeatmap)
input="./data/Feature_count.txt"
feature.length="./data/Feature_length.txt"
counts = read.delim(input, row.names = 1) #read output from FeatureCounts
GeneLength = (read.delim(feature.length, row.names = 1)) #read file with gene length of genes to calculate RPKM
#I have five replicates for each treatment:
head(counts[1:5])
NT1 NT2 NT3 NT4 NT5
AT1G01010 48 51 97 24 55
AT1G01020 204 230 172 180 227
AT1G01030 91 89 59 79 115
AT1G01040 2897 2954 2915 2113 2762
AT1G01050 1011 1030 1140 1018 1137
AT1G01060 8784 9078 10806 8453 9749
#Create the treatment groups (NT - non-treated, W, R and D are the three different treatments):
snames = colnames(counts)
treatment = as.factor(substr(snames, 1, 2)) #remove replicate number
treatment = factor(treatment, levels=c("NT","WT","RT","DT"))
#Create Differential Gene Expression List Object (DGEList) object:
d0 = DGEList(counts = counts,
genes = data.frame(Length=GeneLength),
group = treatment)
d0 = calcNormFactors(d0)
3- Filter DGE object to remove genes with low read count:
RPKM_cutoff = 1
#get index of genes below threshold
drop = d0 %>% rpkm %>% rowMeans %>% `<`(RPKM_cutoff) %>% which
#Remove genes with low read counts
d = d0[-drop, ,keep.lib.sizes=FALSE]
d = calcNormFactors(d)
4- MDS plot of samples. WT and RT group together, while DT and NT seem to be more different. DT is the treatment I am most interested in.
5- Compare all samples against control (non treated).
design = model.matrix(~treatment)
d = estimateDisp(d, design, robust=TRUE)
fit = glmQLFit(d, design, robust=TRUE)
qlf = glmQLFTest(fit, coef=2:4) #compares all the samples to the intercept (control)
AllvsCtrl = topTags(qlf, n = Inf) #unfiltered data
6- Select DE genes with FDR<0.05 and log2(Fold change) greater than 1 in at least one comparison:
FDRts = 0.05 #set FDR threshold
Filtered_byFDR = topTags(qlf, n = Inf, p.value = FDRts)
FCts = log2(2) #set log2(Fold change) threshold
tmp = AllvsCtrl$table %>% select(c(logFC.treatmentWT,logFC.treatmentRT,logFC.treatmentDT)) %>%
as.matrix %>% abs %>% rowMax %>% `>`(FCts) %>% which #Select only genes with abs(logFC)>FCts
Filtered_AllvsCtrl = Filtered_byFDR[tmp,]
head(Filtered_AllvsCtrl$table)
Length logFC.treatmentWT logFC.treatmentRT logFC.treatmentDT logCPM F PValue
AT1G21240 2422 2.96674871 3.7371969 8.059693 4.035865 576.7402 3.437391e-19
AT3G15536 495 1.33337783 2.7300966 5.819152 2.281665 473.0633 1.508623e-18
AT1G09932 1443 1.88438553 3.3242943 5.707507 3.360015 385.8037 1.093145e-17
AT4G00700 3611 1.13613808 2.6300577 6.300830 4.405263 396.3470 2.928449e-17
AT2G38860 2127 -0.07110813 -0.1787093 3.025977 5.621593 360.9900 2.998005e-17
AT3G13950 1206 1.78454256 2.4608951 6.459634 2.503970 338.3309 3.899023e-17
FDR
AT1G21240 5.570980e-15
AT3G15536 1.222513e-14
AT1G09932 5.905536e-14
AT4G00700 9.717734e-14
AT2G38860 9.717734e-14
AT3G13950 1.053191e-13
length(rownames(Filtered_AllvsCtrl$table))
[1] 2705
7- Plot data heatamp to cluster genes by expression pattern (here is where I cannot isolate the gene groups I want): Should I use Z-scored RPKM values? Should I use the log2(FC)?
#Convert RPKM to Z-scores:
RPKMd = as.data.frame(rpkm(d))
colnames(RPKMd) = treatment
Should I average RPKMs between replicates before doing Z-score or after?
RPKMd = sapply(split.default(RPKMd, names(RPKMd)), rowMeans)
zRPKMd = RPKMd %>% as.data.frame %>% zFPKM %>% as.matrix
#Select z-scored RPKM values of genes with DE in at least one treatment:
Hmat = zRPKMd[rownames(zRPKMd) %in% rownames(Filtered_AllvsCtrl$table),]
7a - From what I have been reading, pearson's correlation should be used to cluster genes with a similar expression pattern. When I do so, I see this weird clustering:
Heatmap(Hmat,
clustering_distance_rows = "pearson",
show_row_names = F,
heatmap_legend_param = list(title = "Z-score"),
column_title = "Treatments",
row_title = "DE Genes")
This is how the heatmap looks like using the ComplexHeatmap default distance and clustering methods ("euclidean" distance, "complete" clustering):
Heatmap(Hmat,
show_row_names = F,
row_split = 15,
heatmap_legend_param = list(title = "Z-score"),
column_title = "Treatments",
row_title = "DE Genes")
Clusters 6 and 8 (from top) are close to what I am looking for: genes with high expression in DT, and low in WT,RT and NT). Cluster 10 seems to have the opposite pattern, alsto interesting for me.
I think I can work with those groups of genes, but the visualization is not as clear as I would like (maybe I should change the color scale?).
I could also find these kind of genes "manually" doing something like this, but I would like to have a figure to show them, as I also want to include further analyses done on these groups in the same plot.
data = Filtered_AllvsCtrl$table
onlyDT = data[which((data$logFC.treatmentWT < 1) &
(data$logFC.treatmentRT < 1) &
(data$logFC.treatmentDT > 1)),]
head(onlyDT)
Length logFC.treatmentWT logFC.treatmentRT logFC.treatmentDT logCPM F PValue
AT2G38860 2127 -0.07110813 -0.1787093 3.025977 5.621593 360.9900 2.998005e-17
AT3G25020 3121 0.37136843 0.4851968 2.618082 4.437666 288.8188 1.796895e-16
AT1G78290 1868 -0.79963552 -0.8454600 2.135336 4.624130 278.2638 2.608339e-16
AT2G37110 1173 0.37447283 0.9296454 1.626979 6.245539 244.4434 8.942511e-16
AT2G14610 902 -0.56183051 -1.0595904 7.413948 4.322275 306.0732 1.075079e-15
AT1G76970 2476 0.07736142 0.8145889 2.161111 4.203167 234.4904 1.332456e-15
FDR
AT2G38860 9.717734e-14
AT3G25020 2.426856e-13
AT1G78290 3.019525e-13
AT2G37110 6.587785e-13
AT2G14610 7.575569e-13
AT1G76970 8.305812e-13
Any help with the analysis workflow or visualization would be welcome!
Thanks!
Cross-posted on Biostars: https://www.biostars.org/p/435628/