Question

featureCount error - cannot identify gene_id

0

Entering edit mode

seung.ho.steven.choi • 0

@seunghostevenchoi-22494

Last seen 4.6 years ago

Hello all. I am trying to assess differential exon usage using DEXSeq on two datasets - one published in human cells and my own RNA-seq dataset from a mouse cell line. The human cell dataset serves as a positive control to make sure the pipeline is working as expected, with changes in exon usage that I should be able to recapitulate.

My strategy was to use featureCounts (Rsubread) in order to get count data for each exon. I generated the flattened gtf using the python script provided at https://github.com/vivekbhr/SubreadtoDEXSeq for both mouse and human .gtf formatted gene annotations downloaded from ensembl (most compatible with DEXSeq).

I used the following loop script:

module load Python/3.6.4
for X in Homo_sapiens.GRCh38.98.chr Mus_musculus.GRCm38.98.chr
do
python /hpchome/meyerlab/sc486/applications/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py -f $X.DEXSeq.gtf $X.gtf $X.DEXSeq.gff
done

and got these two "flattened" gtf files as output - one for mouse and one for human.

Mouse - GRCm38.98

sc486@dcc-core-51  /work/sc486/transcriptomes $ head Mus_musculus.GRCm38.98.chr.DEXSeq.gff
1   dexseq_prepare_annotation2.py   aggregate_gene  3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693"
1   dexseq_prepare_annotation2.py   exonic_part 3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693"; transcripts "ENSMUST00000193812"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   aggregate_gene  3102016 3102125 .   +   .   gene_id "ENSMUSG00000064842"
1   dexseq_prepare_annotation2.py   exonic_part 3102016 3102125 .   +   .   gene_id "ENSMUSG00000064842"; transcripts "ENSMUST00000082908"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   aggregate_gene  3205901 3671498 .   -   .   gene_id "ENSMUSG00000051951"
1   dexseq_prepare_annotation2.py   exonic_part 3205901 3206522 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000162897"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   exonic_part 3206523 3207317 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000162897"; exonic_part_number "002"
1   dexseq_prepare_annotation2.py   exonic_part 3213439 3213608 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265"; exonic_part_number "003"
1   dexseq_prepare_annotation2.py   exonic_part 3213609 3214481 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000162897"; exonic_part_number "004"
1   dexseq_prepare_annotation2.py   exonic_part 3214482 3215632 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000070533+ENSMUST00000162897"; exonic_part_number "005"

Human - GRCh38.98

sc486@dcc-core-51  /work/sc486/transcriptomes $ head Homo_sapiens.GRCh38.98.chr.DEXSeq.gff
1   dexseq_prepare_annotation2.py   aggregate_gene  11869   14409   .   +   .   gene_id "ENSG00000223972"
1   dexseq_prepare_annotation2.py   exonic_part 11869   12009   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   exonic_part 12010   12057   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "002"
1   dexseq_prepare_annotation2.py   exonic_part 12058   12178   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "003"
1   dexseq_prepare_annotation2.py   exonic_part 12179   12227   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "004"
1   dexseq_prepare_annotation2.py   exonic_part 12613   12697   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "005"
1   dexseq_prepare_annotation2.py   exonic_part 12698   12721   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "006"
1   dexseq_prepare_annotation2.py   exonic_part 12975   13052   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000450305"; exonic_part_number "007"
1   dexseq_prepare_annotation2.py   exonic_part 13221   13374   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "008"
1   dexseq_prepare_annotation2.py   exonic_part 13375   13452   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "009"

The issue arises when I try to use the featureCount function with these gtfs on their respective datasets. Both datasets were processed from fastq to sam (HISAT2) and then to bam (samtools sort function) in the exact same way (same jobs/scripts just with different indices for mouse/human genomes).

However, only the mouse dataset successfully gives back counts. The error that it spits out for human dataset is as follows:

command for featureCounts on the human dataset:

featureCounts -f -O --fracOverlap 0.1 -p -s 1 -T 2 -F GTF -t 'exonic_part' -a Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf -o /work/sc486/Liu2017-hnRNPG_RNAseq/featurecount.output SRR2661793.sorted.bam SRR2661794.sorted.bam SRR2661795.sorted.bam SRR2661796.sorted.bam SRR2661797.sorted.bam SRR2661798.sorted.bam SRR2661799.sorted.bam SRR2661800.sorted.bam SRR2661801.sorted.bam SRR2661802.sorted.bam SRR2661803.sorted.bam SRR2661804.sorted.bam

outputted error:

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.5

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 12 BAM files                                     ||
||                           o SRR2661793.sorted.bam                          ||
||                           o SRR2661794.sorted.bam                          ||
||                           o SRR2661795.sorted.bam                          ||
||                           o SRR2661796.sorted.bam                          ||
||                           o SRR2661797.sorted.bam                          ||
||                           o SRR2661798.sorted.bam                          ||
||                           o SRR2661799.sorted.bam                          ||
||                           o SRR2661800.sorted.bam                          ||
||                           o SRR2661801.sorted.bam                          ||
||                           o SRR2661802.sorted.bam                          ||
||                           o SRR2661803.sorted.bam                          ||
||                           o SRR2661804.sorted.bam                          ||
||                                                                            ||
||             Output file : featurecount.output                              ||
||                 Summary : featurecount.output.summary                      ||
||              Annotation : Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf (GTF)      ||
||      Dir for temp files : /work/sc486/Liu2017-hnRNPG_RNAseq                ||
||                                                                            ||
||                 Threads : 2                                                ||
||                   Level : feature level                                    ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : counted                                          ||
||   Min overlapping bases : 1                                                ||
||   Min overlapping frac. : 10.0% to reads                                   ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf ...             ||

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id' 
An example of attributes included in your GTF annotation is 'gene_id "ENSG00000179818+ENSG00000244617"; transcripts "ENST00000659305+ENST00000413069+ENST00000664754+ENST00000625788+ENST00000628659+ENST00000630685+ENST00000664014+ENST00000449178+ENST00000628147+ENST00000653289+ENST00000628322+ENST00000596028+ENST00000626515+ENST00000627050+ENST00000626509+ENST00000653606+ENST00000630522+ENST00000419963+ENST00000629485+ENST00000625518+ENST00000629184+ENST00000419542+ENST00000416395+ENST00000596573+ENST00000626842+ENST00000670313+ENST00000664533+ENST00000629605+ENST00000657291+ENST00000630520+ENST00000661703+ENST00000628374+ENST00000626558+ENST00000628667+ENST00000670785+ENST00000658454+ENST00000662425+ENST00000630280+ENST00000653064+ENST00000627327+ENST00000654958+ENST00000625490+ENST00000421255+ENST00000612202+ENST00000458698+ENST00000422515+ENST00000425601+ENST00000614826+ENST00000609543+ENST00000655420+ENST00000435880+ENST00000630377+ENST00000627325+ENST00000625864+ENST00000627189+ENST00000439670+ENST00000628305+ENST00000665838+ENST00000661863+ENST00000420309+ENST00000599673+ENST00000654878+ENST00000667307+ENST00000423402+ENST00000668191+ENST00000594376+ENST00000659586+ENST00000628837+ENST00000669570+ENST00000425333+ENST00000666167+ENST00000600002+ENST00000626609+ENST00000653811+ENST00000458686+ENST00000631107+ENST00000629943+ENST00000429599+ENST00000653596+ENST00000669896+ENST00000452431+ENST00000626162+ENST00000599427+ENST00000629415+ENST00000660404+ENST00000659941+ENST00000657575+ENST00000656326+ENST00000626898+ENST00000630975+ENST00000630746+ENST00000670036+ENST00000629825+ENST00000630214+ENST00000661989+ENST00000626370+ENST00000668711+ENST00000662103+ENST00000629603+ENST00000629836+ENST00000627918+ENST00000628263+ENST00000457076+ENST00000655481+ENST00000654601+ENST00000627155+ENST00000660903+ENST00000667952+ENST00000601396+ENST00000434781+ENST00000629178+ENST00000366234+ENST00000629339+ENST00000601431+ENST00000664419+ENST00000602091+ENST00000669812+ENST00000625625+ENST00000442040+ENST00000665708+ENST00000415222+ENST00000418564+ENST00000444320+ENST00000655096+ENST00000655140+ENST00000437019+ENST00000630759+ENST00000627487+ENST00000629374+ENST00000416506+ENST00000663033+ENST00000597318+ENST00000603347+ENST00000631110+ENST00000655058+ENST00000413791+ENST00000656308+ENST00000665172+ENST00000664958+ENST00000625783+ENST00000664049+ENST00000457770+ENST00000629506+ENST00000657164+ENST00000629084+ENST00000664483+ENST00000626495+ENST00000670786+ENST00000625447+ENST00000629407+ENST00000421843+ENST00000667874+ENST00000654716+ENST00000666074+ENST00000411429+ENST00000654336+ENST00000669583+ENST00000625955+ENST00000659531+ENST00000413436+ENST00000665264+ENST00000630985+ENST00000625215+ENST00000628551+ENST00000664773+ENST00000456161+ENST00000432604+ENST00000662224+ENST00000594548+ENST00000626632+ENST00000629533+ENST00000653114+ENST00000625227+ENST00000670845+ENST00000627074+ENST00000629188+ENST00000656852+ENST00000627398+ENST00000418308+ENST00000626440+EN' 
The program has to terminate.

Both gtf's (mouse/human) that were outputted above and used for featureCount definitely appear to have the string 'gene_id' in the 9th column. So now, I am at a loss as to why one works (mouse) and the other does not (human) on two datasets processed the same exact way with two gtf's generated the same way. Anything I might have missed?

Thank you all in advance for your input.

dexseq rsubread featureCount software error • 2.7k views

ADD COMMENT • link updated 4.6 years ago by Yang Liao ▴ 440 • written 4.6 years ago by seung.ho.steven.choi • 0

score 2 · Answer 1 · 2019-12-04

2

Entering edit mode

Yang Liao ▴ 440

@yang-liao-6075

Last seen 4 days ago

Australia

It looks like some lines in your GTF file are extremely long -- the length exceeded 3000 bytes because of the very long "transcripts" field. On the other hand, featureCounts assumes that no lines in a GTF or an SAF annotation file can be longer than 3000 bytes; featureCounts treats lines longer than 3000 bytes as format errors, hence reporting an error message like that.

A quick fix is to remove the "transcripts" field from the GTF file :

cat Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf  |  sed 's/; transcripts "[^"]\+"//g' > Homo_sapiens.GRCh38.98.chr.DEXSeq-REDUCED.gtf

, then use "Homo_sapiens.GRCh38.98.chr.DEXSeq-REDUCED.gtf" as the annotation file for running featureCounts. Meanwhile, we will fix this issue in featureCounts.

ADD COMMENT • link 4.6 years ago Yang Liao ▴ 440

0

Entering edit mode

Yang Liao: Has the latest version of Rsubread implemented longer lines? I thought the NEWS mentioned this, but I just tried running the exact same workflow as the OP (with the exception that I used GENCODE GTF which is an acceptable gtf file as well for DEXSeq) with the latest version of Rsubread, and I am still getting the same error stated by the OP

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.0.1

loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.6.2      tools_3.6.2         yaml_2.2.0

ADD REPLY • link 4.4 years ago lshepard ▴ 40

score 0 · Answer 2 · 2019-12-04

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 6 hours ago

United States

The error you show says

An example of attributes included in your GTF annotation is 'gene_id "ENSG00000179818+ENSG00000244617";

Where you have two gene_id values for one gene (it's like two genes that are the same thing only maybe not?). I don't know if that violates the GTF format or not, but it appears that subread doesn't like it. You could check your mouse GTF for any doubled-up gene_id values like that, and if there are none, maybe that's the problem.

It shouldn't take too much awk-foo to fix that. If it's just an issue of Ensembl having two genes that really are just one, and they haven't combined them yet, then randomly removing one gene_id should be fine. And if it's something more complicated than that, well do you really want to bother trying to figure out what's up with this random lncRNA, or do you just want to remove it from your GTF file and get on with things?

ADD COMMENT • link 4.6 years ago James W. MacDonald 66k

0

Entering edit mode

Those two genes overlap a teeny tiny bit, and run in the same direction, which I guess is why the flattening software smooshed them together.

ADD REPLY • link 4.6 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

You can avoid this behavior by setting the flag -r=no in the script dexseqprepareannotation.py. With this parameter, these tiny regions that overlap with two different genes are ignored and the genes are not smooshed together.

ADD REPLY • link 4.6 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

You can also try flattenGTF function in Rsubread to flatten exons. flattenGTF with method="chop" flattens exons in the same way as DEXSeq does.

The default flattening method in flattenGTF is merge, which combines overlapping exons in the same gene to form a larger continous exon. This is the method we recommended as it increases the power of downstream statistical analysis.

ADD REPLY • link 4.6 years ago Wei Shi ★ 3.6k