Hello! I've generated .bam gene count files using HTseq and am now trying to use DESeq2 to conduct differential expression analysis. I have ~90 samples with 4 different sample variables. I don't understand how to fix the error message I'm getting (see below)
This is the code I'm using:
library(DESeq2)
directory <- "/path/to/RNAseq"
sampleFiles <- grep("genecounts",list.files(directory),value=TRUE)
#I made columns for each different variable
sampleMb <- sub("(.*mb|gf).*","\\1",sampleFiles)
sampleRegion <- sub(".*(ileum|colon|jej|duo).*","\\1",sampleFiles)
sampleTreatment <-sub(".*(lr|rg|S).*","\\1",sampleFiles)
sampleTreatment <- sub("S","NA",sampleTreatment)
sampleType <- sub(".*(crypt|p0|p5|organoids).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleFiles,
fileName = sampleFiles,
Microbiome = sampleMb,
Region = sampleRegion,
Treatment = sampleTreatment,
Origin = sampleType)
sampleTable$Microbiome <- factor(sampleTable$Microbiome)
sampleTable$Region <- factor(sampleTable$Region)
sampleTable$Treatment <- factor(sampleTable$Treatment)
sampleTable$Origin <- factor(sampleTable$Origin)
#tried to input my files and table into DESeq
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = directory,
design= ~ condition)
ddsHTSeq
I get this error message:
Error in DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, : Gene IDs (first column) differ between files.
Command used to generate the .bam files:
htseq-count -o $newfile -f bam -r pos $file /path/to/<annotationsfile>.gtf
When I run Rsamtools view on one of my samples, I also get an error message:
samtools view trimmed-spf4-jej-crypt_S69_genecounts.bam | head -n 5
[main_samview] fail to read the header from "trimmed-spf4-jej-crypt_S69_genecounts.bam"
Do my files need to have headers? If so, how would I add headers, since HTseq doesn't have an option to do that?
tl;dr Basically, what do I need to do to get DESeqDataSetFromHTSeqCount to work on my files? I don't know the first thing about any of this, so sorry if this is a dumb question!
Session info:
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsamtools_2.4.0 Biostrings_2.56.0 XVector_0.28.0
[4] magrittr_1.5 DESeq2_1.28.0 SummarizedExperiment_1.18.1
[7] DelayedArray_0.14.0 matrixStats_0.56.0 Biobase_2.48.0
[10] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0 IRanges_2.22.1
[13] S4Vectors_0.26.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 locfit_1.5-9.4 lattice_0.20-41
[4] digest_0.6.25 R6_2.4.1 RSQLite_2.2.0
[7] evaluate_0.14 ggplot2_3.3.0 pillar_1.4.4
[10] zlibbioc_1.34.0 rlang_0.4.6 rstudioapi_0.11
[13] annotate_1.66.0 blob_1.2.1 Matrix_1.2-18
[16] rmarkdown_2.1 splines_4.0.0 BiocParallel_1.22.0
[19] geneplotter_1.66.0 stringr_1.4.0 RCurl_1.98-1.2
[22] bit_1.1-15.2 munsell_0.5.0 compiler_4.0.0
[25] xfun_0.13 pkgconfig_2.0.3 base64enc_0.1-3
[28] htmltools_0.4.0 tibble_3.0.1 GenomeInfoDbData_1.2.3
[31] XML_3.99-0.3 crayon_1.3.4 bitops_1.0-6
[34] grid_4.0.0 jsonlite_1.6.1 xtable_1.8-4
I used STAR to generate the alignment files (sorted by position and in .bam format) and set the output files from HTseq to also be .bam files. I didn't set the original hyphenated file names, but I'll definitely keep that in mind!
Here are the commands I used to generate the .bam files that I then fed to HTseq (there are 8 input files because each sample was split into 4 lanes and had fwd/rev files from each lane):
.
If it matters, I ran the raw .fastq files through fastp for trimming before running them in STAR.
Thank you so much for your help!
Thanks! Did you actually check the logs to see if STAR finished the alignment? I am also still puzzled why you feel that the purpose of HTseq is to output another BAM file? The main purpose of
htseq-count
is to perform read count abundance and output a text file that contains your transcripts in your GTF and the number of reads overlapping each, broadly speaking.Perhaps the confusion is with the -o option that you are using with
htseq-count
:For what you are aiming to do, all that you need is:
, where Aligned.sortedByCoord.out.bam is the alignment file produced by STAR.
I see my mistake now--I was working from the htseq count documentation page, which said that the usage was: htseq-count [options] <alignment_files> <gff_file> . As someone with no familiarity with coding who has been learning all of this by googling, I didn't realize that > outputfile.txt was what I needed. I thought the -o option was just generically for specifying the output file, and the documentation for -o said it could be either a sam or bam file output. The documentation page for the DESeq command also didn't point me towards what file format DESeqDataSetFromHTSeqCount takes as its input.
Once I ran htseq count according to your suggestion, I was able to get DESeqDataSetFromHTSeqCount to work. Thanks so much!
Great - no problem. I agree that the
-o
option is somewhat confusing, though. In most programs, it is used to specify the primary output file, which, for HTseq, would be the counts file.