Hello,
I've been running in to some troubles with my data analysis with DESeq2, and one thing I decided to confirm was whether I was checking the difference in my results when I used an input count table for DESeqDataSetFromMatrix or when I followed the guidelines for DESeqDataSetFromHTSeqCount. In the former example I have ~1000 genes that are significant, whereas in the latter I have only 1. I am surprised considering that in theory I would think this is virtually identical input, although I am guessing it is an error on my part.
The same sample files were used as direct input into DESeqDataSetFromHTSeqCount that were used to generate the count table used in DESeqDataSetFromMatrix. I have also spot checked some of these files, and the numbers within the individual file match what is in the geneCounts data.frame.
I have entered the code below. Any help would be greatly appreciated. Thank you in advance!
directory <- "/ebio//abt3_projects//Christensenella_Project//CM12//CM12_RNA-seq//WAT_v1//HTSeq//20171006_ALL" sampleFiles <- grep("htseq_count",list.files(directory),value=TRUE) sampleTable <- read.table('/ebio//abt3_projects//Christensenella_Project//CM12//CM12_RNA-seq//WAT_v1//CM12_WAT_Design_v11.txt',header=TRUE,sep="\t") sample_table2=data.frame(sampleName = sampleTable$SampleID, fileName = sampleFiles, condition = sampleTable$Condition) ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table2, directory = directory, design= ~ condition) contrasts = c('condition', 'Blautia', 'Cminuta') dds <- DESeq(dds) res <- results(dds, contrast=contrasts) summary(res) out of 37999 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 0, 0% LFC < 0 (down) : 1, 0.0026% outliers [1] : 804, 2.1% low counts [2] : 2, 0.0053% (mean count < 0) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
The input to generate the count table for DEseqDataSetfromMatrix is below
Cmin1=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S1_htseq_count_V1.txt",sep="\t",header=FALSE) Cmin2=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S5_htseq_count_V1.txt", sep="\t",header=FALSE) Cmin3=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S10_htseq_count_V1.txt",sep="\t",header=FALSE) Cmin4=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S15_htseq_count_V1.txt",sep="\t",header=FALSE) Cmin5=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S20_htseq_count_V1.txt",sep="\t",header=FALSE) Bhyd1=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S13_htseq_count_V1.txt",sep="\t",header=FALSE) Bhyd2=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S19_htseq_count_V1.txt",sep="\t",header=FALSE) Bhyd3=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S2_htseq_count_V1.txt",sep="\t",header=FALSE) Bhyd4=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S4_htseq_count_V1.txt",sep="\t",header=FALSE) Bhyd5=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S8_htseq_count_V1.txt",sep="\t",header=FALSE) geneCounts=data.frame(Cmin1[,2],Cmin2[,2],Cmin3[,2],Cmin4[,2],Cmin5[,2],Bhyd1[,2],Bhyd2[,2],Bhyd3[,2],Bhyd4[,2],Bhyd5[,2],Bhyd6[,2],Bhyd7[,2],Bhyd8[,2],Bhyd9[,2],Bhyd10[,2],BiAssoc1[,2],BiAssoc2[,2],BiAssoc3[,2],BiAssoc4[,2],BiAssoc5[,2],BiAssoc6[,2],BiAssoc7[,2],BiAssoc8[,2],BiAssoc9[,2],GF1[,2],GF2[,2]) row.names(geneCounts)=Cmin1[,1] sizeGeneCounts=dim(geneCounts) geneCounts = geneCounts[1:(sizeGeneCounts[1]-5),] condition=c(rep("C_minuta",5),rep("B_hydrogenotrophica",10),rep("BiAssociated",9),rep("Germfree",2)) sampleNames=c("Cmin1","Cmin2","Cmin3","Cmin4","Cmin5","Bhyd1","Bhyd2","Bhyd3","Bhyd4","Bhyd5","Bhyd6","Bhyd7","Bhyd8","Bhyd9","Bhyd10","BiAssoc1","BiAssoc2","BiAssoc3","BiAssoc4","BiAssoc5","BiAssoc6","BiAssoc7","BiAssoc8","BiAssoc9","GF1","GF2") colnames(geneCounts) = sampleNames write.table(geneCounts, "geneCounts_20171009_V1.txt", sep="\t",quote=FALSE)
Input for DESeqDataSetFromMatrix
count_file='/ebio//abt3_projects//Christensenella_Project/CM12/CM12_RNA-seq//WAT_v1//HTSeq//20171006_ALL/geneCounts_20171009_V1.txt' meta_file='/ebio//abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/CM12_WAT_Design_v11.txt' meta = read.delim(meta_file, sep='\t') counts = read.delim(count_file, sep='\t') # ordering rownames(meta) = meta$SampleID meta = meta[colnames(counts),] dds_old_2 = DESeqDataSetFromMatrix(countData=counts, colData=meta, design=~Condition) contrasts = c('Condition', 'Blautia', 'Cminuta') dds_old_2 <- DESeq(dds_old_2) res_old_2 <- results(dds_old_2, contrast=contrasts) summary(res_old_2) out of 37999 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 145, 0.38% LFC < 0 (down) : 932, 2.5% outliers [1] : 142, 0.37% low counts [2] : 13714, 36% (mean count < 2) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
SessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /ebio/abt3_projects/Christensenella_Project/anaconda3/envs/R_DEseq2/lib/R/lib/libRblas.so LAPACK: /ebio/abt3_projects/Christensenella_Project/anaconda3/envs/R_DEseq2/lib/R/lib/libRlapack.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocParallel_1.12.0 DESeq2_1.18.1 [3] SummarizedExperiment_1.8.0 DelayedArray_0.4.1 [5] matrixStats_0.52.2 Biobase_2.38.0 [7] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 [9] IRanges_2.12.0 S4Vectors_0.16.0 [11] BiocGenerics_0.24.0 ggplot2_2.2.1 [13] tidyr_0.7.2 dplyr_0.7.4 loaded via a namespace (and not attached): [1] bit64_0.9-7 jsonlite_1.5 splines_3.4.1 [4] Formula_1.2-2 assertthat_0.2.0 latticeExtra_0.6-28 [7] blob_1.1.0 GenomeInfoDbData_0.99.1 RSQLite_2.0 [10] backports_1.1.1 lattice_0.20-35 glue_1.2.0 [13] uuid_0.1-2 digest_0.6.12 RColorBrewer_1.1-2 [16] XVector_0.18.0 checkmate_1.8.5 colorspace_1.3-2 [19] htmltools_0.3.6 Matrix_1.2-12 plyr_1.8.4 [22] XML_3.98-1.9 pkgconfig_2.0.1 genefilter_1.60.0 [25] zlibbioc_1.24.0 purrr_0.2.4 xtable_1.8-2 [28] scales_0.5.0 tibble_1.3.4 htmlTable_1.9 [31] annotate_1.56.1 repr_0.12.0 nnet_7.3-12 [34] lazyeval_0.2.1 survival_2.41-3 magrittr_1.5 [37] crayon_1.3.4 memoise_1.1.0 evaluate_0.10.1 [40] foreign_0.8-69 tools_3.4.1 data.table_1.10.4-3 [43] stringr_1.2.0 locfit_1.5-9.1 munsell_0.4.3 [46] cluster_2.0.6 AnnotationDbi_1.40.0 bindrcpp_0.2 [49] compiler_3.4.1 rlang_0.1.4 grid_3.4.1 [52] RCurl_1.95-4.8 pbdZMQ_0.2-6 IRkernel_0.7.1 [55] htmlwidgets_0.9 bitops_1.0-6 base64enc_0.1-3 [58] gtable_0.2.0 DBI_0.7 R6_2.2.2 [61] gridExtra_2.3 knitr_1.17 bit_1.1-12 [64] bindr_0.1 Hmisc_4.0-3 stringi_1.1.6 [67] IRdisplay_0.4.4 Rcpp_0.12.14 geneplotter_1.56.0 [70] rpart_4.1-11 acepack_1.4.1
Hi Michael,
I ran this, and they are clearly not identical to one another. At first glance the order of the columns is different, as well as some of the values.
After a more thorough look at the input sample information, I can now see that the discrepancy is that in my input sample_table_2 the file name does not match the correct file name, and this is what caused the error.
Apologies for having missed this before posting, and thank you so much for your response!