Hi,
I am running into issues with loading my GTF annotation file into featureCounts. I have used featureCounts previously on this dual-seq dataset to count reads aligned to a bacterial genome but now I want to examine the reads aligned to the human genome. I downloaded my alignment genome and GTF annotation file at the same time and source from NCBI (GCF_000001405.40) and aligned the reads using hisat2. Consistently, featureCounts is failing with this error:
Failed to open the annotation file ~/20241010.GCF_000001405.40.NCBI.RefSeq.gtf, or its format is incorrect, or it contains no 'transcript' features.
I have tried counting exons, transcripts, and genes using the -t flag. I have tried both with an without the -g flag. I have tried removing the commented out lines from the top of my GTF file and editing the GTF as described in this thread using the following code:
cut -d';' -f1 20241010.GCF_000001405.40.NCBI.RefSeq.gtf.gtf > 20241010.GCF_000001405.40.NCBI.RefSeq.edited.gtf
So far nothing has worked. My script is included below as well as the heads of both the original and edited versions of the GTF.
I have used full file paths throughout the script, I've just removed them here as they contain my username.
#!/bin/bash
#SBATCH --job-name=featurecounts.human
#SBATCH -o ~/slurm.out/featurecounts.human.out
#SBATCH -e ~/slurm.out/featurecounts.human.err
# Number of compute nodes
#SBATCH --nodes=1
# Number of tasks per node
#SBATCH --ntasks-per-node=2
# Number of CPUs per task
#SBATCH --cpus-per-task=4
# Request memory
#SBATCH --mem=8G
# Walltime
#SBATCH --time=48:00:00
# Set array
#SBATCH --array=1-16
# Email notifications (comma-separated options: BEGIN,END,FAIL)
#SBATCH --mail-type=BEGIN,FAIL,END
source /optnfs/common/miniconda3/etc/profile.d/conda.sh
filelist=($(~/coinfection.rnaseq.files.txt))
i=${filelist[${SLURM_ARRAY_TASK_ID}]}
conda activate featurecounts
featureCounts \
-a ~/20241010.GCF_000001405.40.NCBI.RefSeq.edited.gtf \
-o ~/featurecounts.human/"$i" \
~/human.aligned.reads/"$i".aligned.reads.sorted.bam \
-O -t transcript -g gene_id -f -d 50 --verbose
[]$ head -n 7 20241010_GCF_000001405.40_GRCh38.p14.NCBI_RefSeq.gtf
#gtf-version 2.2
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 08/23/2024
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2024_08
NC_000001.11 BestRefSeq gene 11874 14409 . + . gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
NC_000001.11 BestRefSeq transcript 11874 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; db_xref "GenBank:NR_046018.2"; db_xref "HGNC:HGNC:37102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript";
[]$ head -n 7 20241010_GCF_000001405.40_GRCh38.p14.NCBI_RefSeq.edited.gtf
#gtf-version 2.2
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 08/23/2024
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2024_08
NC_000001.11 BestRefSeq gene 11874 14409 . + . gene_id "DDX11L1"
NC_000001.11 BestRefSeq transcript 11874 14409 . + . gene_id "DDX11L1"
I'm at a bit of a loss and would appreciate any help.
Hi,
I clearly just needed to step away from the computer screen for the weekend. Your line of code made me realize that I actually just had a typo in my code and it wasn't actually pointing to my reference GTF...corrected the typo and everything is working as expected. My bad. Thanks so much for taking the time to reply!