Problem running featurecounts in Rsubread on GFF file
0
0
Entering edit mode
raf4 ▴ 20
@raf4-8249
Last seen 3 months ago
United States

Dear Bioconductor list:

I have aligned some paired end reads to the human genome assembly
CRCh38.p7 with R subread. I am now trying to count the reads with feature
counts. It is taking over an hour on my mac, and as I recall it should
run faster. This is teh first time that I have used featurecounts with a
GFF file as distinct from the built in annotation, and I am wondering if
that is the problem.

Here is my R script:

b1.res<-featureCounts(files="b1.res.bam",annot.ext="GCA_000001405.22_GRCh38.p7_genomic.gff",isGTFAnnotationFile=TRUE,GTF.featureType="gene",isPairedEnd=TRUE,requireBothEndsMapped=FALSE)
write.table(b1.res$counts,"b1.rescounts.GRCh38.p7.txt",quote=F,sep="\t") My bam file is 25 Gb in size. Here are the first few lines of my GFF file (obtained from the NCBI web site, after the annotation: J01415.2 Genbank region 1 16569 . + . ID=id549;Dbxref=taxon:9606;Is_circular=true;Name=MT;country=United Kingdom: Great Britain;gbkey=Src;genome=mitochondrion;isolation-source=caucasian;mol_type=genomic DNA;note=this is the rCRS;tissue-type=placenta J01415.2 Genbank D_loop 16024 17145 . - . ID=id550;gbkey=D-loop J01415.2 Genbank gene 577 647 . + . ID=gene0;Dbxref=GeneID:4558;Name=TRNF;gbkey=Gene;gene=TRNF;gene_biotype=tRNA;gene_synonym=MTTF Here is are excerpts from my output file: > library(Rsubread) Warning message: package ‘Rsubread’ was built under R version 3.3.3 > b1.res<-featureCounts(files="b1.res.bam",annot.ext="GCA_000001405.22_GRCh38.p7_genomic.gff",isGTFAnnotationFile=TRUE,GTF.featureType="gene",isPairedEnd=TRUE,requireBothEndsMapped=FALSE) ================================\\ || || || Load annotation file GCA_000001405.22_GRCh38.p7_genomic.gff ... || Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'gene_id' The attributes included in your GTF annotation are 'ID=gene0;Dbxref=GeneID:4558;Name=TRNF;gbkey=Gene;gene=TRNF;gene_biotype=tRNA;gene_synonym=MTTF' || Features : 37 || || Meta-features : 1 || || Chromosomes/contigs : 1 || || || || Process BAM file b1.res.bam... || || Paired-end reads are included. || || Assign fragments (read pairs) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Read re-ordering is performed. || || || c2b2afmd2:GRCh38 friedman$

My questions:

1. Is it just talking long or is it unable to read the gff file?
2. I originally put in a GTF.attrType="gene_id" option but I got pretty much teh same warning message, and also ran for a long time. Since I figured that I did not have to sum over exons Icould omit it.
3. I used the GFF filecorrsponding to the genome patch I used. If the GFF file is
not in the wirte format is is ok to use teh bulit in hg38 assembly?
the ncbi site says that the patches do not chnage the numbering,

Thanks and best wishes,
Rich

Richard A. Friedman, PhD
Columbia University Medical Center

rsubread featurecounts gff Grch38.p7 vs hg38 • 3.2k views
0
Entering edit mode

an addendum, It ran to completion with teh following output file"

b1.res.bam
NA    22698646

I will now try it with the internal annotation, but please let me know if

the hg38 annotation can be applied to the patch.

THANKS!

Rich

0
Entering edit mode

I believe I found part of the problem. I should  not have used the patched sequence.

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

If this is not the right version of the sequence, please let me know.

The other part of the problem is that I should not have used the gff file but

rather should have used the built-in annotation.

Thanks and best wishes,