Question

Error when loading annotation featureCounts

0

Entering edit mode

Jason • 0

@jason-24931

Last seen 3.4 years ago

I am trying load the annotated genome of Arabidopsis thaliana but i get this weird error that I cannot understand. Where could the problem be?

========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ | __ | ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== _ | | | | _ <| _ /| | / /\ \ | | | | ==== ____) | |__| | |_) | | \ | |____ / ____ | |__| | ========== |_____/ __/|__/|_| ___/_/ ____/ v2.0.1

//========================== featureCounts setting ===========================\ || || || Input files : 18 BAM files || || o bulk_trimmedAligned.sortedByCoord.out.bam || || o G2_trimmedAligned.sortedByCoord.out.bam || || o lepto_1_trimmedAligned.sortedByCoord.out.bam || || o lepto_2_trimmedAligned.sortedByCoord.out.bam || || o lepto_3_trimmedAligned.sortedByCoord.out.bam || || o lepto_4_trimmedAligned.sortedByCoord.out.bam || || o lepto_5_trimmedAligned.sortedByCoord.out.bam || || o pachy_1_trimmedAligned.sortedByCoord.out.bam || || o pachy_2_trimmedAligned.sortedByCoord.out.bam || || o pachy_3_trimmedAligned.sortedByCoord.out.bam || || o pachy_4_trimmedAligned.sortedByCoord.out.bam || || o pachy_5_trimmedAligned.sortedByCoord.out.bam || || o somatic_trimmedAligned.sortedByCoord.out.bam || || o zygo_1_trimmedAligned.sortedByCoord.out.bam || || o zygo_2_trimmedAligned.sortedByCoord.out.bam || || o zygo_3_trimmedAligned.sortedByCoord.out.bam || || o zygo_4_trimmedAligned.sortedByCoord.out.bam || || o zygo_5_trimmedAligned.sortedByCoord.out.bam || || || || Output file : count_matrix.txt || || Summary : count_matrix.txt.summary || || Annotation : GCF_000001735.4_TAIR10.1_genomic.gtf (GTF) || || Dir for temp files : /home/chromosome/Desktop/test/feature_counts || || Assignment details : <input_file>.featureCounts.bam || || (Note that files are saved to the output directory) || || || || || || Threads : 4 || || Level : meta-feature level || || Paired-end : no || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || Min overlapping bases : 1 || || || \============================================================================//

//================================= Running ==================================\ || || || Load annotation file GCF_000001735.4_TAIR10.1_genomic.gtf ... ||

ERROR: the 84702-th line in your GTF file is extremely long (longer than 199999 bytes). The program cannot parse this line.

featureCounts Rsubread subread CellBiology • 2.0k views

ADD COMMENT • link updated 3.4 years ago by Yang Liao ▴ 440 • written 3.4 years ago by Jason • 0

score 2 · Answer 1 · 2021-03-04

I think it says right there:

ERROR: the 84702-th line in your GTF file is extremely long (longer than 199999 bytes). The program cannot parse this line.

Which says that the 84702th line is too long for the program to read. I have no idea why a GTF entry would need to be that long, and it probably indicates that there is something wrong with the GTF file you are using. I don't see a GTF at NCBI and Google can't find it for me, so you will probably have to figure it out on your own, unless you can point to where you got it.

score 0 · Answer 2 · 2021-03-04

There is a GCF_000001735.4_TAIR10.1_genomic.gtf.gz from NCBI and, indeed, some of its lines are really long.

$ zcat GCF_000001735.4_TAIR10.1_genomic.gtf.gz|awk '{print length($0)}'|sort -n|tail
370728
370729
456810
456810
456810
457612
457612
457612
457619
457620

, so the longest line has 458k characters.

It is because the sources for inferring the annotations are listed in the GTF file, and sometime there can be tens of thousands of sources reported in a line of annotation. This sed command can remove the lists of sources from the GTF file:

$ cat GCF_000001735.4_TAIR10.1_genomic.gtf | sed 's/ inference ".*";//g' > GCF_000001735-shorter.GTF

, then you can use GCF_000001735-shorter.GTF in featureCounts. Meanwhile, the maximum length of lines will be increased to 1 million bytes in the next release version.