Hello all. I am trying to assess differential exon usage using DEXSeq on two datasets - one published in human cells and my own RNA-seq dataset from a mouse cell line. The human cell dataset serves as a positive control to make sure the pipeline is working as expected, with changes in exon usage that I should be able to recapitulate.
My strategy was to use featureCounts (Rsubread) in order to get count data for each exon. I generated the flattened gtf using the python script provided at https://github.com/vivekbhr/SubreadtoDEXSeq for both mouse and human .gtf formatted gene annotations downloaded from ensembl (most compatible with DEXSeq).
I used the following loop script:
module load Python/3.6.4
for X in Homo_sapiens.GRCh38.98.chr Mus_musculus.GRCm38.98.chr
do
python /hpchome/meyerlab/sc486/applications/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py -f $X.DEXSeq.gtf $X.gtf $X.DEXSeq.gff
done
and got these two "flattened" gtf files as output - one for mouse and one for human.
Mouse - GRCm38.98
sc486@dcc-core-51 /work/sc486/transcriptomes $ head Mus_musculus.GRCm38.98.chr.DEXSeq.gff
1 dexseq_prepare_annotation2.py aggregate_gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"
1 dexseq_prepare_annotation2.py exonic_part 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; transcripts "ENSMUST00000193812"; exonic_part_number "001"
1 dexseq_prepare_annotation2.py aggregate_gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation2.py exonic_part 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcripts "ENSMUST00000082908"; exonic_part_number "001"
1 dexseq_prepare_annotation2.py aggregate_gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation2.py exonic_part 3205901 3206522 . - . gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000162897"; exonic_part_number "001"
1 dexseq_prepare_annotation2.py exonic_part 3206523 3207317 . - . gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000162897"; exonic_part_number "002"
1 dexseq_prepare_annotation2.py exonic_part 3213439 3213608 . - . gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265"; exonic_part_number "003"
1 dexseq_prepare_annotation2.py exonic_part 3213609 3214481 . - . gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000162897"; exonic_part_number "004"
1 dexseq_prepare_annotation2.py exonic_part 3214482 3215632 . - . gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000070533+ENSMUST00000162897"; exonic_part_number "005"
Human - GRCh38.98
sc486@dcc-core-51 /work/sc486/transcriptomes $ head Homo_sapiens.GRCh38.98.chr.DEXSeq.gff
1 dexseq_prepare_annotation2.py aggregate_gene 11869 14409 . + . gene_id "ENSG00000223972"
1 dexseq_prepare_annotation2.py exonic_part 11869 12009 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "001"
1 dexseq_prepare_annotation2.py exonic_part 12010 12057 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "002"
1 dexseq_prepare_annotation2.py exonic_part 12058 12178 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "003"
1 dexseq_prepare_annotation2.py exonic_part 12179 12227 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "004"
1 dexseq_prepare_annotation2.py exonic_part 12613 12697 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "005"
1 dexseq_prepare_annotation2.py exonic_part 12698 12721 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "006"
1 dexseq_prepare_annotation2.py exonic_part 12975 13052 . + . gene_id "ENSG00000223972"; transcripts "ENST00000450305"; exonic_part_number "007"
1 dexseq_prepare_annotation2.py exonic_part 13221 13374 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "008"
1 dexseq_prepare_annotation2.py exonic_part 13375 13452 . + . gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "009"
The issue arises when I try to use the featureCount function with these gtfs on their respective datasets. Both datasets were processed from fastq to sam (HISAT2) and then to bam (samtools sort function) in the exact same way (same jobs/scripts just with different indices for mouse/human genomes).
However, only the mouse dataset successfully gives back counts. The error that it spits out for human dataset is as follows:
command for featureCounts on the human dataset:
featureCounts -f -O --fracOverlap 0.1 -p -s 1 -T 2 -F GTF -t 'exonic_part' -a Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf -o /work/sc486/Liu2017-hnRNPG_RNAseq/featurecount.output SRR2661793.sorted.bam SRR2661794.sorted.bam SRR2661795.sorted.bam SRR2661796.sorted.bam SRR2661797.sorted.bam SRR2661798.sorted.bam SRR2661799.sorted.bam SRR2661800.sorted.bam SRR2661801.sorted.bam SRR2661802.sorted.bam SRR2661803.sorted.bam SRR2661804.sorted.bam
outputted error:
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.6.5
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 12 BAM files ||
|| o SRR2661793.sorted.bam ||
|| o SRR2661794.sorted.bam ||
|| o SRR2661795.sorted.bam ||
|| o SRR2661796.sorted.bam ||
|| o SRR2661797.sorted.bam ||
|| o SRR2661798.sorted.bam ||
|| o SRR2661799.sorted.bam ||
|| o SRR2661800.sorted.bam ||
|| o SRR2661801.sorted.bam ||
|| o SRR2661802.sorted.bam ||
|| o SRR2661803.sorted.bam ||
|| o SRR2661804.sorted.bam ||
|| ||
|| Output file : featurecount.output ||
|| Summary : featurecount.output.summary ||
|| Annotation : Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf (GTF) ||
|| Dir for temp files : /work/sc486/Liu2017-hnRNPG_RNAseq ||
|| ||
|| Threads : 2 ||
|| Level : feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : counted ||
|| Min overlapping bases : 1 ||
|| Min overlapping frac. : 10.0% to reads ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf ... ||
ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'gene_id "ENSG00000179818+ENSG00000244617"; transcripts "ENST00000659305+ENST00000413069+ENST00000664754+ENST00000625788+ENST00000628659+ENST00000630685+ENST00000664014+ENST00000449178+ENST00000628147+ENST00000653289+ENST00000628322+ENST00000596028+ENST00000626515+ENST00000627050+ENST00000626509+ENST00000653606+ENST00000630522+ENST00000419963+ENST00000629485+ENST00000625518+ENST00000629184+ENST00000419542+ENST00000416395+ENST00000596573+ENST00000626842+ENST00000670313+ENST00000664533+ENST00000629605+ENST00000657291+ENST00000630520+ENST00000661703+ENST00000628374+ENST00000626558+ENST00000628667+ENST00000670785+ENST00000658454+ENST00000662425+ENST00000630280+ENST00000653064+ENST00000627327+ENST00000654958+ENST00000625490+ENST00000421255+ENST00000612202+ENST00000458698+ENST00000422515+ENST00000425601+ENST00000614826+ENST00000609543+ENST00000655420+ENST00000435880+ENST00000630377+ENST00000627325+ENST00000625864+ENST00000627189+ENST00000439670+ENST00000628305+ENST00000665838+ENST00000661863+ENST00000420309+ENST00000599673+ENST00000654878+ENST00000667307+ENST00000423402+ENST00000668191+ENST00000594376+ENST00000659586+ENST00000628837+ENST00000669570+ENST00000425333+ENST00000666167+ENST00000600002+ENST00000626609+ENST00000653811+ENST00000458686+ENST00000631107+ENST00000629943+ENST00000429599+ENST00000653596+ENST00000669896+ENST00000452431+ENST00000626162+ENST00000599427+ENST00000629415+ENST00000660404+ENST00000659941+ENST00000657575+ENST00000656326+ENST00000626898+ENST00000630975+ENST00000630746+ENST00000670036+ENST00000629825+ENST00000630214+ENST00000661989+ENST00000626370+ENST00000668711+ENST00000662103+ENST00000629603+ENST00000629836+ENST00000627918+ENST00000628263+ENST00000457076+ENST00000655481+ENST00000654601+ENST00000627155+ENST00000660903+ENST00000667952+ENST00000601396+ENST00000434781+ENST00000629178+ENST00000366234+ENST00000629339+ENST00000601431+ENST00000664419+ENST00000602091+ENST00000669812+ENST00000625625+ENST00000442040+ENST00000665708+ENST00000415222+ENST00000418564+ENST00000444320+ENST00000655096+ENST00000655140+ENST00000437019+ENST00000630759+ENST00000627487+ENST00000629374+ENST00000416506+ENST00000663033+ENST00000597318+ENST00000603347+ENST00000631110+ENST00000655058+ENST00000413791+ENST00000656308+ENST00000665172+ENST00000664958+ENST00000625783+ENST00000664049+ENST00000457770+ENST00000629506+ENST00000657164+ENST00000629084+ENST00000664483+ENST00000626495+ENST00000670786+ENST00000625447+ENST00000629407+ENST00000421843+ENST00000667874+ENST00000654716+ENST00000666074+ENST00000411429+ENST00000654336+ENST00000669583+ENST00000625955+ENST00000659531+ENST00000413436+ENST00000665264+ENST00000630985+ENST00000625215+ENST00000628551+ENST00000664773+ENST00000456161+ENST00000432604+ENST00000662224+ENST00000594548+ENST00000626632+ENST00000629533+ENST00000653114+ENST00000625227+ENST00000670845+ENST00000627074+ENST00000629188+ENST00000656852+ENST00000627398+ENST00000418308+ENST00000626440+EN'
The program has to terminate.
Both gtf's (mouse/human) that were outputted above and used for featureCount definitely appear to have the string 'gene_id' in the 9th column. So now, I am at a loss as to why one works (mouse) and the other does not (human) on two datasets processed the same exact way with two gtf's generated the same way. Anything I might have missed?
Thank you all in advance for your input.
Yang Liao: Has the latest version of
Rsubread
implemented longer lines? I thought the NEWS mentioned this, but I just tried running the exact same workflow as the OP (with the exception that I used GENCODE GTF which is an acceptable gtf file as well forDEXSeq
) with the latest version of Rsubread, and I am still getting the same error stated by the OP