Question

Error with DESeqDataSetFromHTSeqCount

0

Entering edit mode

tjk30 • 0

@tjk30-23510

Last seen 3.9 years ago

Hello! I've generated .bam gene count files using HTseq and am now trying to use DESeq2 to conduct differential expression analysis. I have ~90 samples with 4 different sample variables. I don't understand how to fix the error message I'm getting (see below)

This is the code I'm using:

library(DESeq2)
directory <- "/path/to/RNAseq"
sampleFiles <- grep("genecounts",list.files(directory),value=TRUE)
#I made columns for each different variable
sampleMb <- sub("(.*mb|gf).*","\\1",sampleFiles)
sampleRegion <- sub(".*(ileum|colon|jej|duo).*","\\1",sampleFiles)
sampleTreatment <-sub(".*(lr|rg|S).*","\\1",sampleFiles)
sampleTreatment <- sub("S","NA",sampleTreatment)
sampleType <- sub(".*(crypt|p0|p5|organoids).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleFiles,
                          fileName = sampleFiles,
                          Microbiome = sampleMb,
                          Region = sampleRegion,
                          Treatment = sampleTreatment,
                          Origin = sampleType)
sampleTable$Microbiome <- factor(sampleTable$Microbiome)
sampleTable$Region <- factor(sampleTable$Region)
sampleTable$Treatment <- factor(sampleTable$Treatment)
sampleTable$Origin <- factor(sampleTable$Origin)
#tried to input my files and table into DESeq
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                       directory = directory,
                                       design= ~ condition)
ddsHTSeq

I get this error message:

Error in DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, : Gene IDs (first column) differ between files.

Command used to generate the .bam files:

htseq-count  -o $newfile -f bam -r pos $file /path/to/<annotationsfile>.gtf

When I run Rsamtools view on one of my samples, I also get an error message:

samtools view trimmed-spf4-jej-crypt_S69_genecounts.bam | head -n 5
[main_samview] fail to read the header from "trimmed-spf4-jej-crypt_S69_genecounts.bam"

Do my files need to have headers? If so, how would I add headers, since HTseq doesn't have an option to do that?

tl;dr Basically, what do I need to do to get DESeqDataSetFromHTSeqCount to work on my files? I don't know the first thing about any of this, so sorry if this is a dumb question!

Session info:

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Rsamtools_2.4.0             Biostrings_2.56.0           XVector_0.28.0             
 [4] magrittr_1.5                DESeq2_1.28.0               SummarizedExperiment_1.18.1
 [7] DelayedArray_0.14.0         matrixStats_0.56.0          Biobase_2.48.0             
[10] GenomicRanges_1.40.0        GenomeInfoDb_1.24.0         IRanges_2.22.1             
[13] S4Vectors_0.26.0            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6           locfit_1.5-9.4         lattice_0.20-41       
 [4] digest_0.6.25          R6_2.4.1               RSQLite_2.2.0         
 [7] evaluate_0.14          ggplot2_3.3.0          pillar_1.4.4          
[10] zlibbioc_1.34.0        rlang_0.4.6            rstudioapi_0.11       
[13] annotate_1.66.0        blob_1.2.1             Matrix_1.2-18         
[16] rmarkdown_2.1          splines_4.0.0          BiocParallel_1.22.0   
[19] geneplotter_1.66.0     stringr_1.4.0          RCurl_1.98-1.2        
[22] bit_1.1-15.2           munsell_0.5.0          compiler_4.0.0        
[25] xfun_0.13              pkgconfig_2.0.3        base64enc_0.1-3       
[28] htmltools_0.4.0        tibble_3.0.1           GenomeInfoDbData_1.2.3
[31] XML_3.99-0.3           crayon_1.3.4           bitops_1.0-6          
[34] grid_4.0.0             jsonlite_1.6.1         xtable_1.8-4

DESeqDataSetFromHTSeqCount • 1.0k views

ADD COMMENT • link updated 3.9 years ago by Kevin Blighe ★ 3.9k • written 3.9 years ago by tjk30 • 0

score 0 · Answer 1 · 2020-05-12

0

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 11 days ago

Republic of Ireland

Hey, it seems to be a problem with your BAM files, and not necessarily DESeq2, Rsamtools, or HTseq. How were your BAMs produced?

The HTseq command that you're using (htseq-count -o $newfile -f bam -r pos $file /path/to/<annotationsfile>.gtf) is not creating the BAMs, by the way - it looks at the information in your BAMs and counts reads over the transcripts defined in your input GTF. These BAMs should have headers, and it is not the role of HTseq to add the headers - these should already exist from the other [currently unknown] program that you used to create the BAMs.

Also, I would not use hyphens in filenames - these can cause problems downstream. An underscore is much better.

Kevin

ADD COMMENT • link 3.9 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

I used STAR to generate the alignment files (sorted by position and in .bam format) and set the output files from HTseq to also be .bam files. I didn't set the original hyphenated file names, but I'll definitely keep that in mind!

Here are the commands I used to generate the .bam files that I then fed to HTseq (there are 8 input files because each sample was split into 4 lanes and had fwd/rev files from each lane):

STAR  --runMode genomeGenerate \
--runThreadN 8 \
--genomeDir /path/to/gencode/mm10_gencode_starindex \
--sjdbGTFfile /path/to/gencode.vM22.annotation.gtf \
--genomeFastaFiles /path/to/gencode/GRCm38.primary_assembly.genome.fa

.

  STAR --genomeDir /path/to/gencode/mm10_gencode_starindex \
      --outSAMtype BAM SortedByCoordinate \
      --runThreadN 8 --readFilesIn f1.fastq,f2.fastq,f3.fastq,f4.fastq r1.fastq,r2.fastq,r3.fastq,r4.fastq \
      --outFileNamePrefix $sample

If it matters, I ran the raw .fastq files through fastp for trimming before running them in STAR.

Thank you so much for your help!

ADD REPLY • link 3.9 years ago tjk30 • 0

0

Entering edit mode

Thanks! Did you actually check the logs to see if STAR finished the alignment? I am also still puzzled why you feel that the purpose of HTseq is to output another BAM file? The main purpose of htseq-count is to perform read count abundance and output a text file that contains your transcripts in your GTF and the number of reads overlapping each, broadly speaking.

Perhaps the confusion is with the -o option that you are using with htseq-count:

-o <samout>, --samout=<samout>

write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)

For what you are aiming to do, all that you need is:

htseq-count -f bam -r pos \
  Aligned.sortedByCoord.out.bam /path/to/gencode.vM22.annotation.gtf > counts.txt ;

, where Aligned.sortedByCoord.out.bam is the alignment file produced by STAR.

ADD REPLY • link 3.9 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

I see my mistake now--I was working from the htseq count documentation page, which said that the usage was: htseq-count [options] <alignment_files> <gff_file> . As someone with no familiarity with coding who has been learning all of this by googling, I didn't realize that > outputfile.txt was what I needed. I thought the -o option was just generically for specifying the output file, and the documentation for -o said it could be either a sam or bam file output. The documentation page for the DESeq command also didn't point me towards what file format DESeqDataSetFromHTSeqCount takes as its input.

Once I ran htseq count according to your suggestion, I was able to get DESeqDataSetFromHTSeqCount to work. Thanks so much!

ADD REPLY • link 3.9 years ago tjk30 • 0

0

Entering edit mode

Great - no problem. I agree that the -o option is somewhat confusing, though. In most programs, it is used to specify the primary output file, which, for HTseq, would be the counts file.

ADD REPLY • link 3.9 years ago Kevin Blighe ★ 3.9k