Question

featurecounts Rsubread ERROR on exon_id

0

Entering edit mode

edejong • 0

@edejong-20404

Last seen 5.1 years ago

Dear,

Currently, I would like to calculate exon counts using package Rsubread 1.32.0, function featurecounts. Inputs:

Bam file based on paired end RNA sequencing.
Annotation file in GTF format, downloaded from GenCode v27.

More details about Bam: - Created based on uBam. - Aligned with Star - Unmapped reads are kept in/ added to bam - File is validated multiple times with Picard.ValidateSamFile - Duplicates are flagged - Gatk.SplitNCigarReads - Gatk.BaseRecalibrator - Gatk.ApplyBQSR

From my point of view, I should be able to run following command:

featureCounts(
  files=inputfile, 
  annot.ext=annotationfile, 
  isGTFAnnotationFile = TRUE,
  GTF.featureType = "exon", 
  GTF.attrType = "exon_id",
  isPairedEnd = TRUE)

In theory, this would result in counts per exon (defined by GTF.featureType; column 3 of gtf file) and mapped to exon_id as metafeature (defined by GTF.attrType; column 9 of gtf file). However, I receive an ERROR:

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'exonid' An example of attributes included in your GTF annotation is 'geneid "IGH.g@"; transcriptid "IGH.t@"; genename "IGH@";' The program has to terminate.

There is a parameter called useMetaFeatures where I can disable to metafeatures. Provided annotation as part of the results is geneid and not exonid, so I have to provide a metafeature to specify exon_id.

I also tried the commandline package version, namely subread and had the same problem for subread v1.6.4. It is working with subread v1.5.2 with the exact same inputs and parameters as specified above. I was wondering whether the functionality has changed on purpose or that this is a bug?

Thanks in advance,

Ellen de Jong.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /hpc/local/CentOS7/common/lang/R/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /hpc/local/CentOS7/common/lang/R/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.24.3    limma_3.38.3    Rsubread_1.32.0

loaded via a namespace (and not attached):
[1] compiler_3.5.1  tools_3.5.1     Rcpp_1.0.1      grid_3.5.1     
[5] locfit_1.5-9.1  lattice_0.20-38

R error software error Rsubread featurecounts • 1.9k views

ADD COMMENT • link updated 5.1 years ago by Wei Shi ★ 3.6k • written 5.1 years ago by edejong • 0

0

Entering edit mode

It also works in an earlier version of Rsubread:

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /hpc/local/CentOS7/common/lang/R/3.4.1/lib64/R/lib/libRblas.so
LAPACK: /hpc/local/CentOS7/common/lang/R/3.4.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.20.9    limma_3.34.9    Rsubread_1.28.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1  tools_3.4.1     Rcpp_0.12.14    grid_3.4.1      locfit_1.5-9.1 
[6] lattice_0.20-35

ADD REPLY • link 5.1 years ago edejong • 0

score 1 · Answer 1 · 2019-04-04

1

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 22 days ago

Australia/Melbourne/Olivia Newton-John …

The error message tells you that 'exonid' is not a valid attribute and this resulted in the termination of featureCounts. You should use geneid instead for your exon-level counting. Also you should specify useMetaFeatures=FALSE so that the counting is performed at the feature level (ie exon level). Below is the command that should work for your counting:

featureCounts( files=inputfile, annot.ext=annotationfile, isGTFAnnotationFile = TRUE, GTF.featureType = "exon", GTF.attrType = "geneid", isPairedEnd = TRUE, useMetaFeatures=FALSE)

ADD COMMENT • link 5.1 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Thanks. Probably need to use GTF.attrType.extra to retrieve the exon_ids as part of the annotation.

However, this does not result in counts table with exon_ids. I find it quite dangerous to assume that the order of the count table is in the same order as the annotation table.

ADD REPLY • link 5.1 years ago edejong • 0

0

Entering edit mode

As I mentioned in my reply exon_ids is not an attribute included in your annotation, so you should not use it in any parameters you provided to featureCounts.

ADD REPLY • link 5.1 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Sorry to bother you again, but I am not sure if I understand it completely. When I open my annotation file, I do see 'exon_id' as one of the fields, when 'exon' is stated in column 3. So than I would assume I can use this. And as specified before, it is working that way in version 1.28.1

I would like to receive counts in following format (random example):

ENSE00000000001.1    154314
ENSE00000000002.1     235452
ENSE00000000003.1     35542

description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90) provider: GENCODE contact: gencode-help@sanger.ac.uk format: gtf date: 2017-08-01

chr1 HAVANA gene 11869 14409 . + . geneid "ENSG00000223972.5"; genetype "transcribedunprocessedpseudogene"; genename "DDX11L1"; level 2; havanagene "OTTHUMG00000000961.2"; chr1 HAVANA transcript 11869 14409 . + . geneid "ENSG00000223972.5"; transcriptid "ENST00000456328.2"; genetype "transcribedunprocessedpseudogene"; genename "DDX11L1"; transcripttype "processedtranscript"; transcriptname "DDX11L1-202"; level 2; transcriptsupportlevel "1"; tag "basic"; havanagene "OTTHUMG00000000961.2"; havanatranscript "OTTHUMT00000362751.1"; chr1 HAVANA exon 11869 12227 . + . geneid "ENSG00000223972.5"; transcriptid "ENST00000456328.2"; genetype "transcribedunprocessedpseudogene"; genename "DDX11L1"; transcripttype "processedtranscript"; transcriptname "DDX11L1-202"; exonnumber 1; *exonid "ENSE00002234944.1"*; level 2; transcriptsupportlevel "1"; tag "basic"; havanagene "OTTHUMG00000000961.2"; havanatranscript "OTTHUMT00000362751.1";

ADD REPLY • link 5.1 years ago edejong • 0

1

Entering edit mode

There is an attribute type exonid in the few annotation lines you provided but not 'exon_id'. featureCounts from any previous versions of Rsubread releases won't work on 'exon_id'.

featureCounts also reported in its error message that the 9th column of the first CDS feature in your annotation contains geneid "IGH.g@"; transcriptid "IGH.t@"; genename "IGH@";, which doesn't include either exonid or 'exon_id'.

ADD REPLY • link 5.1 years ago Wei Shi ★ 3.6k