Hi, I am a master student in biomedicine at Lund University in Sweden. I am using edgeR for differential expression of genes. I am a beginner in the use of edgeR.
setwd("C:/Users/Analysis")
library("edgeR")
library(tidyr)
data <- read.csv("C:/Users/Analysis/Data/Case_Control_Phenotype_Features.csv")
group <- as.factor(data$T2D) #group include individuals with T2D (YES) and non-diabetics (NO)
age <-as.numeric(data$Age_exact)
sex <-as.factor(data$Sex)
BMI <-as.numeric(gsub(",",".",data$BMI))
cd8t <- as.numeric(data$CD8T)
cd4t <- as.numeric(data$CD4T)
nk <- as.numeric(data$NK)
bcell <- as.numeric(data$Bcell)
mono <- as.numeric(data$Mono)
gran <- as.numeric(data$Gran)
# gene counts is a csv file obtained from FeatureCounts: first column contains the ENSids (from row 2), column two has in row 1 the sample ID and the rest are the counts for each ENSid.
x <- read.csv("C:/Users/Analysis_CC/Data/case_control_gene_counts.csv") # number of case 91 and number of control 66
head(x)
length(x) #154
length(group) #153
y <- DGEList(counts=x,group=group) # Setting first column of `counts` as gene annotation, I got this message when I run the script. Doing that I have the same lenght 153, for both x and y.
head(y)
# count per millions: like or bigger than 1 and 73 is the 80% of the case
keep <- rowSums(cpm(y)>=1) >= 73
table(keep)
#Filtration
y <- y[keep, , keep.lib.sizes=FALSE]
head(y)
#Recalculate library sizes
y$samples$lib.size <- colSums(y$counts)
y$samples
# Normalization
y <- calcNormFactors(y)
y$samples
# Adjusting the cpm after the recalculation of the library size and the normalization
c<-cpm(y)
write.csv(c,file="C:/Users/Analysis/Results/.../testedgeRtmmoutputN_par_1_73.csv")
#To get logcpm
logcpm <- cpm(y, prior.count=2, log=TRUE)
write.csv(logcpm,file="C:/Users/Analysis/Results/.../logcpm_par.csv")
#Estimate the dispersion to model the biological and technical variability in your data
design <- model.matrix(~group+sex+age+BMI+cd8t+cd4t+nk+bcell+mono+gran)
y <- estimateDisp(y, design)
# Perform exact testing for differential expression (how do I differenziate case from control)
et <- exactTest(y)
# Differential Expression Analysis:
write.table(topTags(et, 65000L), file = "C:/Users/Analysis/Results/outputfile.csv")
results <- read.table("C:/Users/Analysis/Results/outputfile.csv")
results$FDR<-p.adjust(results$PValue,"fdr")
write.table(results$FDR, "C:/Users/Analysis/Results/outputfile_FDR.csv")
I think that what I wrote in the line: keep <- rowSums(cpm(y)>=1) >= 73
is not what I want to do.
I want to filtarte the genes that has at least 1 cpm. I also what to keep 80% of the case (in the group T2D the case have the value YES and are 91 samples ) to use in the test exactTest(y). Control are 66 samples.
My question is: how the case is identified in the group? (with a number like case 1 and control 0?) Are the case are assigned? How can I do in my code, to filter only 80% of the case to compare with the control when I perform the exactTest?
I hope it is clear what I am asking. Best regards Francesca