Error in makeTranscriptDbFromGFF
2
0
Entering edit mode
komusica • 0
@komusica-7491
Last seen 9.1 years ago
Norway

Hi 

I am working on zebrafish small RNA data and for running DESEQ2 I need rowdata which comes from the library("GenomicFeatures").

I have tried makeTranscriptDbFromGFF on several files coming from different sources such as gff from NCBI or gff3 from mirbase and GTF that I made from converting the gff files.

dre<-makeTranscriptDbFromGFF(file="/home/chawla/zebrafish/genome/dre_mirbase.gtf",format="gtf")

dre<-makeTranscriptDbFromGFF(file="/home/chawla/zebrafish/genome/dre.gff3",format="gff3")

dre<-makeTranscriptDbFromGFF(file="/home/chawla/zebrafish/genome/GCF_000002035.4_Zv9_genomic.gff",format="gff")

but each time I get the error 

Error in .parse_attrCol(attrCol, file, colnames) : 
  Some attributes do not conform to 'tag=value' format

 

I have tried to check for and removed trailing spaces but still no use. I checked several posts but didnt find any other solution to it. Hoping for more ideas to fix it.

 

Thanks

 

 

deseq2 genomicfeatures • 1.4k views
ADD COMMENT
0
Entering edit mode

Hi,

This error comes from rtracklayer::import(), which makeTranscriptDbFromGFF() calls internally to load the GFF or GTF data into a GRanges object. I typically get this error on compressed GTF files when I forget to specify format="gtf" because in that case rtracklayer::import() fails to automatically detect the format and tries to load the data as GFF. You can try to run that step manually first to confirm.

Otherwise it would help if we could access to these files so we can reproduce the problem. Also please provide the output of sessionInfo().

Thanks,

H.

ADD REPLY
0
Entering edit mode
komusica • 0
@komusica-7491
Last seen 9.1 years ago
Norway

Hi Hervé,

Thanks for your response.

I tried rtracklayer which was working, but makeTranscriptDbFromGFF didnt work. Please correct me if I used it wrong.

library(rtracklayer)
file ="/home/chawla/zebrafish/genome/dre.gff3"
test<-import(file)
export(test,"/home/chawla/zebrafish/genome/dre.gtf","gtf")

dre<-makeTranscriptDbFromGFF(file="/home/chawla/zebrafish/genome/dre.gtf",format="gtf")
Error in .parse_attrCol(attrCol, file, colnames) : 
  Some attributes do not conform to 'tag value' format

dre<-makeTranscriptDbFromGFF(file="/home/chawla/zebrafish/genome/dre.gff3",format="gff3")

Error in .parse_attrCol(attrCol, file, colnames) : 
  Some attributes do not conform to 'tag=value' format

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] rtracklayer_1.22.7        GenomicFeatures_1.14.5   
 [3] AnnotationDbi_1.24.0      Biobase_2.22.0           
 [5] DESeq2_1.2.10             RcppArmadillo_0.4.650.1.1
 [7] Rcpp_0.11.5               GenomicRanges_1.14.4     
 [9] XVector_0.2.0             IRanges_1.20.7           
[11] BiocGenerics_0.8.0       

loaded via a namespace (and not attached):
 [1] annotate_1.40.1    biomaRt_2.18.0     Biostrings_2.30.1  bitops_1.0-6      
 [5] BSgenome_1.30.0    DBI_0.3.1          genefilter_1.44.0  grid_3.1.3        
 [9] lattice_0.20-30    locfit_1.5-9.1     RColorBrewer_1.1-2 RCurl_1.95-4.5    
[13] Rsamtools_1.14.3   RSQLite_0.11.4     splines_3.1.3      stats4_3.1.3      
[17] survival_2.38-1    tools_3.1.3        XML_3.98-1.1       xtable_1.7-4      
[21] zlibbioc_1.8.0 

 

And link to the file I am using ftp://mirbase.org/pub/mirbase/CURRENT/genomes/dre.gff3   for reproducing the error. 

Thanks

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 8 hours ago
Seattle, WA, United States

Hi,

This file contains only lines of type miRNA_primary_transcript or miRNA, and it has no Parent attribute so it's not really suited for being stored in a TxDb object. Note that one of the primary purposes of a TxDb object is to describe the hierarchical relationship between genes, transcripts, exons, and CDS. AFAIK microRNAs don't have exons and are not linked to genes so they don't really fit the TxDb model. Maybe they don't really fit the DESeq2 (and summarizeOverlaps()) models either because these tools typically count reads using a GRangesList object that represents the exons grouped by transcript or by gene. Of course you can always force your microRNAs into that model by considering that each miRNA is made of 1 exon:

library(rtracklayer)
gr <- import("dre.gff3")

library(GenomicFeatures)
transcripts <- data.frame(
    tx_id=seq_along(gr),
    tx_name=mcols(gr)$ID,
    tx_type=mcols(gr)$type,
    tx_chrom=as.factor(seqnames(gr)),
    tx_start=start(gr),
    tx_end=end(gr),
    tx_strand=as.factor(strand(gr))
)
splicings <- data.frame(
    tx_id=transcripts$tx_id,
    exon_rank=1,
    exon_start=transcripts$tx_start,
    exon_end=transcripts$tx_end
)
txdb <- makeTxDb(transcripts, splicings, reassign.ids=TRUE)

Note that the above code is for the current devel version of BioC  (3.1). If you use the current release (BioC 3.0), you need to use makeTranscriptDb() instead of makeTxDb() and the tx_type column will be ignored.

Then you can extract the exons grouped by transcript in the usual way:

exonsBy(txdb, by="tx", use.names=TRUE)
# GRangesList object of length 888:
# $MI0002023 
# GRanges object with 1 range and 3 metadata columns:
#       seqnames           ranges strand |   exon_id   exon_name exon_rank
#          <Rle>        <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]        1 [451141, 451218]      + |         1        <NA>         1
#
# $MIMAT0001851 
# GRanges object with 1 range and 3 metadata columns:
#       seqnames           ranges strand | exon_id exon_name exon_rank
#   [1]        1 [451151, 451172]      + |       2      <NA>         1
#
# $MI0002180 
# GRanges object with 1 range and 3 metadata columns:
#       seqnames             ranges strand | exon_id exon_name exon_rank
#   [1]        1 [1275348, 1275428]      + |       3      <NA>         1
#
# ...
# <885 more elements>
# -------
# seqinfo: 28 sequences from an unspecified genome; no seqlengths

But note that this GRangesList object is not much different from the original GRanges object gr.

I would suggest that you consult with the DESeq2 authors first to check that this is a valid approach. If so, and if having the microRNAs in a TxDb makes it easier to do DE analysis, then maybe we should modify makeTxDbFromGFF() to support this kind of GFF3 file with only microRNAs.

Anyway for now I applied a patch to rtracklayer and GenomicFeatures in BioC release to make makeTranscriptDbFromGFF() fail more gracefully (and with a more informative error message) on this kind of GFF3 file.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

An update on this: starting with GenomicFeatures 1.22.2, makeTxDbFromGFF() supports the GFF files from miRBase. GenomicFeatures 1.22.2 will become available in BioC 3.2, the current release, over the weekend.

H.

ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6