Hi all,
I am a high school student interning in a bioinformatics laboratory this summer, and I'm new to R, RNA-seq, etc. I'm analyzing data from an experiment looking at gene expression changes between two conditions over the course of three timepoints. I attempted to use DESeq2's LRT option in order to identify genes with condition-specific temporal expression changes, but only had one significant result. When another member of my lab used LRT, he received several hundred significant results. I've pasted both of our scripts (and relevant outputs) below -- can anyone see what might have caused such a drastic difference in our results?
For reference, counts.tsv is the gene expression matrix, and coldata.tsv is a matrix with samples as rows, Condition & Time as columns, and individual matrix elements are a sample's condition or collection timepoint. Coldata.tsv is identical to the matrix created using cbind() in the second script.
My script & outputs:
# Data Selection & Prep
cts <- as.matrix(read.csv(file.choose(), sep = "\t", row.names = "gene_id")) # select counts.tsv
coldata <- read.csv(file.choose(), sep = "\t", row.names = "Samples") # select coldata.tsv
Time <- factor(coldata$Time, levels = c("3","7","14"))
Condition <- factor(coldata$Condition, levels = c("A", "B"))
all(rownames(coldata) == colnames(cts))
*[1] TRUE*
# LRT
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ Condition + Time + Condition:Time)
*Warning message:*
In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
dds_lrt <- DESeq(dds, test = "LRT", reduced = ~ Condition + Time)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
# Extracting Results
res_lrt <- results(dds_lrt)
res_lrt <- res_lrt[order(res_lrt$pvalue), ]
res_lrt
log2 fold change (MLE): ConditionB.Time7
LRT p-value: '~ Condition + Time + Condition:Time' vs '~ Condition + Time'
DataFrame with 55291 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Dhh 305.439 -0.318403 0.220665 28.5176 6.41922e-07 0.0234160
Id1 328.252 0.522730 0.351085 24.5935 4.56646e-06 0.0832876
Pdgfra 256.663 1.174865 0.628269 16.9216 2.11605e-04 1.0000000
Aldh1a7 102.727 -0.110197 0.517990 16.8907 2.14897e-04 1.0000000
Ahrr 655.521 -0.205311 0.287271 16.8163 2.23040e-04 1.0000000
... ... ... ... ... ... ...
Gm21748 0 NA NA NA NA NA
mt-Ty 0 NA NA NA NA NA
mt-Td 0 NA NA NA NA NA
mt-Tk 0 NA NA NA NA NA
mt-Th 0 NA NA NA NA NA
sig_lrt <- subset(res_l, baseMean>20 & padj<0.05)
sig_lrt
baseMean log2FoldChange lfcSE stat pvalue padj
Dhh 305.4386 -0.3184035 0.2206646 28.5176 6.419222e-07 0.02341604
Another lab member's script & outputs:
library(DESeq2)
counts<-read.table("counts.tsv")
design<- cbind(condition=c('A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B'), samples=c("A30.R_S2","A43.L_S13","A45.R_S16","A32.R_S4","A33.L_S8","A43.RL_S14","A32.L_S5","A35.L_S11","A46.R_S17","A30.N_S1","A30.RL_S3","A33.RR_S10","A32.RL_S6","A33.RL_S9","A44.R_S15","A32.RR_S7","A35.RL_S12","A46.L_S18"), time=c('3','3','3','7','7','7','14','14','14','3','3','3','7','7','7','14','14','14'))
full_model <- ~condition + time + condition:time
reduced_model <- ~condition + time
dds<-DESeqDataSetFromMatrix(countData=counts,colData=design, design= ~ condition + time + condition:time)
dds_lrt_time <- DESeq(dds, test="LRT", reduced = ~ condition + time)
res<-results(dds_lrt_time)
sig <- subset(res, baseMean>20 & padj < 0.05 )
log2 fold change (MLE): conditionB.time7
LRT p-value: '~ condition + time + condition:time' vs '~ condition + time'
DataFrame with 342 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Xkr4 44.2752 0.951242 0.521092 27.89740 8.75299e-07 7.09958e-05
Gm19938 32.8723 -2.692142 0.982106 12.29223 2.14178e-03 1.82805e-02
Sntg1 83.0165 0.629027 0.339907 9.57564 8.33058e-03 4.65789e-02
Eya1 40.5863 -0.668471 0.537290 9.47682 8.75254e-03 4.79500e-02
Gm29216 223.6502 -0.650962 0.362693 9.93755 6.95166e-03 4.11904e-02
... ... ... ... ... ... ...
Eif2s3y 509.392 -1.962028 2.566384 18.8918 7.90119e-05 1.79711e-03
Uty 165.199 -2.873323 3.045134 14.5634 6.88029e-04 8.24535e-03
Ddx3y 500.055 -2.019335 2.229747 16.1066 3.18052e-04 4.75764e-03
Gm47283 198.531 -1.234294 0.374524 36.3764 1.26170e-08 3.34519e-06
mt-Rnr2 42367.112 -0.627189 0.283350 11.4852 3.20650e-03 2.40670e-02
Any help on this would be greatly appreciated! The other lab member and I are both new to NGS analyses, so we are uncertain of how to proceed. We hope that some more experienced bioinformaticians might be able to guide us on the right path, as we are also the first people in our lab to perform these types of analyses. Thank you!
Kind regards, Luke Briody
Cross-posted: https://www.biostars.org/p/458040/