I am doing RNA-seq analysis of differential gene expression from mare's endometrial biopsies. I do not have any treatments though, and this is the reason I am having a hard time trying to figure out the design I should use as all examples I found so far had treatments + time points.
I have the first time point of 0h which is my control (endometrial biopsy from a mare) then 24h, 48h and 72h. All I want to do is to understand what happens over the time points, without any treatment. I have 8 animals and for each animal I have the 4 timepoints. My metadata has "sample_id" column with my bam files,"timepoint" column and "horse_id" column.
My coding looks like this:
txdb <- makeTxDbFromGFF("Equus_caballus.EquCab2.89.gtf", format="gtf") ebg <- exonsBy(txdb, by="gene") se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) dds <- DESeqDataSet(se, design = ~ timepoint) dds <- DESeq(dds) res_0hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "control_0h", "timepoint_72h")) res_0hand24h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "control_0h", "timepoint_24h")) res_24hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_24h", "timepoint_72h")) res_48hand72h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_48h", "timepoint_72h")) res_24hand48h_LFC1 <- results(dds, lfcThreshold = 1, contrast=c("timepoint", "timepoint_24h", "timepoint_48h")) table(res_0hand72h_LFC1$padj < 0.1) table(res_0hand24h_LFC1$padj < 0.1) table(res_24hand72h_LFC1$padj < 0.1) table(res_48hand72h_LFC1$padj < 0.1) table(res_24hand48h_LFC1$padj < 0.1)
Running the above I got results comparing two time points at a time (false rate discovery set for 0.1% with a >2 fold change). Results look ok to me. E.g. When comparing control (0h) to 72h time point a total of 22542 genes were expressed, 824 of which were shown to be statistically significantly differentially up expressed and 785 shown to be statistically significantly differentially down expressed, both at a false discovery rate (FDR) of 0.1% with a fold change greater or equal to 2.
BUT I wanted to have one result comparing all time points, but I am not sure if that's doable/reasonable?
first doubt about my design:
I am not sure whether I need to add my horse_id to the design, maybe
dds <- DESeqDataSet(se, design = ~ treatment + horse_id)
dds <- DESeqDataSet(se, design = ~ treatment + horse_id + treatment:horse_id)
I do not know whether or not to use the interaction between treatment and horse_id.
second doubt about the time series analysis:
Should I use nbimLRT() function to test the significance of multiple coefficients at once?
ddsLRT <- DESeqDataSet(se, design = ~ treatment*horse_id) design(ddsLRT) <- formula(~ treatment*horse_id) ddsLRT <- estimateSizeFactors(ddsLRT) ddsLRT <- estimateDispersions(ddsLRT) ddsLRT <- nbinomLRT(ddsLRT, reduced = formula(~ treatment)) resLRT <- results(ddsLRT) table(resLRT$padj < 0.1)
and the result was FALSE 22542 genes, meaning that none of the expressed genes were statistically significantly expressed, which is a results different from the pairwise analysis of time points separatedely.
I hope I made myself clear and I really look forward to receiving any help/thoughts on this.