Existence of a makeTxDbFromGFF function in GenomicFeatures package?
5
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.4 years ago
United States

Objective: I am following 'C. Differential Expression' file from bioconductor workshop, Apr, 2015 (section 5. Reduction, Setup, part c). Would like to create a TxDb object from the GFF3 file (ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/setaria_italica/) for the Setaria italica genome used to align reads and using the TxDb to get all exons, grouped by gene.

I ran the following code getting the following error:

> path <- "/Users/bterry/macbookpro2015/keenanres15/Sitalica/genome/gff3_ver26"
> gff3file <- file.path(path, "Setaria_italica.JGIv2.0.26.gff3")
> gff3file
[1] "/Users/bterry/macbookpro2015/keenanres15/Sitalica/genome/gff3_ver26/Setaria_italica.JGIv2.0.26.gff3"
> library(GenomicFeatures)

> txdb <- makeTxDbFromGFF(gff3file, format="gtf", circ_seqs=character())
Error: could not find function "makeTxDbFromGFF"
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.18.7 AnnotationDbi_1.28.1   Biobase_2.26.0        
[4] GenomicRanges_1.18.3   GenomeInfoDb_1.2.3     IRanges_2.0.0         
[7] S4Vectors_0.4.0        BiocGenerics_0.12.1   

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8             
 [4] BiocParallel_1.0.0      biomaRt_2.22.0          Biostrings_2.34.0      
 [7] bitops_1.0-6            brew_1.0-6              checkmate_1.5.0        
[10] codetools_0.2-10        DBI_0.3.1               digest_0.6.6           
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.1
[16] iterators_1.0.7         RCurl_1.95-4.5          Rsamtools_1.18.2       
[19] RSQLite_1.0.0           rtracklayer_1.26.3      sendmailR_1.2-1        
[22] stringr_0.6.2           tools_3.1.3             XML_3.98-1.1           
[25] XVector_0.6.0           zlibbioc_1.12.0        
>

Looking through the GenomicFeatures vignette, I find mention of makeTxDbFromUCSC and makeTxDbFromBiomart, but no mention of makeTxDbFromGFF?

Thanks for comments,

 

genomicfeatures • 6.5k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 57 minutes ago
Seattle, WA, United States

Hi Philip,

The landing page for the SeattleApr2015 workshop (http://bioconductor.org/help/course-materials/2015/SeattleApr2015/) doesn't say it but this workshop was based on BioC 3.1 (upcoming release, scheduled for April 17, based on R 3.2). You can actually see this here:

    http://bioconductor.org/help/course-materials/

In BioC 3.1, all the makeTranscriptDbFrom*() functions were renamed makeTxDbFrom*(). The old names are still working but are now deprecated.

H.

ADD COMMENT
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.4 years ago
United States

Hi Herve,

Followup to my submission to bioc. support: Question: Existence of a makeTxDbFromGFF function in GenomicFeatures package?

1) I updated to R 3.2 after 04/17/15
2) I followed directions on http://www.bioconductor.org/install/ page to install latest ver. of bioconductor.
3) Then ran previous code again with following error:
library(GenomicFeatures)
path <- "/Users/bterry/macbookpro2015/keenanres15/Sitalica26/genome/gff3_ver26"
path
gff3file <- file.path(path, "Setaria_italica.JGIv2.0.26.gff3")
gff3file
> txdb <- makeTxDbFromGFF(gff3file, format="gtf", circ_seqs=character())
Error in .parse_attrCol(attrCol, file, colnames) :
  Some attributes do not conform to 'tag value' format
In addition: Warning message:
In .local(con, format, text, ...) :
  gff-version directive indicates version is 3, not 2
>
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.20.0 AnnotationDbi_1.30.0   Biobase_2.28.0        
[4] GenomicRanges_1.20.1   GenomeInfoDb_1.4.0     IRanges_2.2.0         
[7] S4Vectors_0.6.0        BiocGenerics_0.14.0   

loaded via a namespace (and not attached):
 [1] XML_3.98-1.1            Rsamtools_1.20.0        Biostrings_2.36.0      
 [4] GenomicAlignments_1.4.0 bitops_1.0-6            futile.options_1.0.0   
 [7] DBI_0.3.1               RSQLite_1.0.0           zlibbioc_1.14.0        
[10] XVector_0.8.0           futile.logger_1.4       lambda.r_1.1.7         
[13] BiocParallel_1.2.0      tools_3.2.0             biomaRt_2.24.0         
[16] RCurl_1.95-4.5          rtracklayer_1.28.0     
>

4) I looked into perhaps trying to convert my ggf3 file to a gtf file as the example in the Apr., 2015 bioc. wkshop, but will first request your advice as to how to proceed.

Thanks,

 

ADD COMMENT
0
Entering edit mode

Hi Philip,

Looks like your file is GFF, not GTF. Try again with format="gff3" (which is actually the default).

Cheers,

H.

ADD REPLY
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.4 years ago
United States

Hi Herve,

Looks like I succeeded in creating a TxDb file from my gff3 file. My question here involves some additional checking I have been doing on that TxDb file before preceeding to the 'Counting' section following part c referred to in my initial question at the beginning of this thread.

> gff3file <- file.path(path, "Setaria_italica.JGIv2.0.26.gff3")
> gff3file
[1] "/Users/bterry/macbookpro2015/keenanres15/Sitalica26/genome/gff3_ver26/Setaria_italica.JGIv2.0.26.gff3"
> txdb <- makeTxDbFromGFF(gff3file, format="gff3", circ_seqs=character())
Prepare the 'metadata' data frame ... metadata: OK
> genes <- exonsBy(txdb, by="gene")
> genes
GRangesList object of length 0:
<0 elements>

-------
seqinfo: 76 sequences from an unspecified genome; no seqlengths

So there are **no** genes listed in this GRangesList? I note Setaria_italica.JGIv2.0.26.gff3 has 35471 records or rows with 'gene' in column 3 of its gff3 format.

In comparison, for the TxDb package: TxDb.Hsapiens.UCSC.hg19.knownGene

> txdbhg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
> geneshg19 <- exonsBy(txdbhg19, by="gene")
> geneshg19
GRangesList object of length 23459:
$1
GRanges object with 15 ranges and 2 metadata columns:
       seqnames               ranges strand   |   exon_id   exon_name
          <Rle>            <IRanges>  <Rle>   | <integer> <character>
   [1]    chr19 [58858172, 58858395]      -   |    250809        <NA>
   [2]    chr19 [58858719, 58859006]      -   |    250810        <NA>
   [3]    chr19 [58859832, 58860494]      -   |    250811        <NA>
   [4]    chr19 [58860934, 58862017]      -   |    250812        <NA>
   [5]    chr19 [58861736, 58862017]      -   |    250813        <NA>
   ...      ...                  ...    ... ...       ...         ...
  [11]    chr19 [58868951, 58869015]      -   |    250821        <NA>
  [12]    chr19 [58869318, 58869652]      -   |    250822        <NA>
  [13]    chr19 [58869855, 58869951]      -   |    250823        <NA>
  [14]    chr19 [58870563, 58870689]      -   |    250824        <NA>
  [15]    chr19 [58874043, 58874214]      -   |    250825        <NA>

$10
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]     chr8 [18248755, 18248855]      + |  113603      <NA>
  [2]     chr8 [18257508, 18258723]      + |  113604      <NA>

...
<23457 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
 [2] BiocInstaller_1.18.1                   
 [3] GenomicFeatures_1.20.0                 
 [4] AnnotationDbi_1.30.0                   
 [5] Biobase_2.28.0                         
 [6] GenomicRanges_1.20.2                   
 [7] GenomeInfoDb_1.4.0                     
 [8] IRanges_2.2.1                          
 [9] S4Vectors_0.6.0                        
[10] BiocGenerics_0.14.0                    

loaded via a namespace (and not attached):
 [1] XVector_0.8.0           zlibbioc_1.14.0         GenomicAlignments_1.4.0
 [4] BiocParallel_1.2.0      tools_3.2.0             DBI_0.3.1              
 [7] lambda.r_1.1.7          futile.logger_1.4.1     rtracklayer_1.28.2     
[10] futile.options_1.0.0    bitops_1.0-6            RCurl_1.95-4.5         
[13] biomaRt_2.24.0          RSQLite_1.0.0           Biostrings_2.36.0      
[16] Rsamtools_1.20.1        XML_3.98-1.1           
>

So geneshg19 has GRangesList object of length 23459

What to think about this difference, 0 compared to 23459?


Thanks for comments,

 

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 57 minutes ago
Seattle, WA, United States

Hi Philip,

Hard to know for sure without having access to your GFF3 file but the reason why exonsBy(txdb, by="gene") returns an empty GRangesList object is probably because your TxDb object doesn't contain any gene. TxDb objects are transcript-centric and transcripts are not always linked to a gene id. For example, this is the case for more than 12% of the transcripts in TxDb.Hsapiens.UCSC.hg19.knownGene:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txlens <- transcriptLengths(TxDb.Hsapiens.UCSC.hg19.knownGene)
head(txlens)
##   tx_id    tx_name   gene_id nexon tx_len
## 1     1 uc001aaa.3 100287102     3   1652
## 2     2 uc010nxq.1 100287102     3   1488
## 3     3 uc010nxr.1 100287102     3   1595
## 4     4 uc001aal.1     79501     1    918
## 5     5 uc001aaq.2      <NA>     1     32
## 6     6 uc001aar.2      <NA>     1     62
table(is.na(txlens$gene_id))
## FALSE  TRUE 
## 73432  9528 

In the particular case of your GFF3 file (Setaria_italica.JGIv2.0.26.gff3), it seems that no transcript (or exon) at all is linked to a gene id (exons can only be linked to a gene id via a transcript). The fact that you see tens of thousands of features of type "gene" in the file suggests that for some reason the features representing transcripts could not be linked to these genes. This could be due to (a) a bug in makeTxDbFromGRanges() (the workhorse behind makeTxDbFromGFF()), (b) some kind of non-conformance of your file to the GFF3 specs which makeTxDbFromGRanges() doesn't know how to handle yet, or (c) it could be "real" i.e. the file actually doesn't contain any information linking the transcript to these genes.

The only way for me to tell for sure would be to have access to this file.

Thanks,

H.

ADD COMMENT
0
Entering edit mode

1) The **goal** for my submitted questions is to follow the procedure in the 'C. Differential Expression' presentation, April 06-07, 2015, bioconductor workshop, to prepare a SummarizedExperiment for subsequent analysis using DESeq2.

Can I accomplish this goal starting with Setaria_italica.JGIv2.0.26.gff3?

2) url for page with Setaria italica gff3 is
ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/setaria_italica/

3) I note a TxDb object (for Setaria_italica.JGIv2.0.26.gff3) selected by "tx" rather than 'gene' gives 40599 GRangesList objects rather than 0 (reference my previous submitted question on this thread).

> grlsitalica <- exonsBy(txdb, "tx", use.names=TRUE)
Warning message:
In .set.group.names(ans, use.names, txdb, by) :
  some group names are NAs or duplicated
> grlsitalica
GRangesList object of length 40599:
$NA
GRanges object with 2 ranges and 3 metadata columns:
      seqnames         ranges strand |   exon_id       exon_name exon_rank
         <Rle>      <IRanges>  <Rle> | <integer>     <character> <integer>
  [1]        1 [10197, 13134]      + |         1 Si016158m.exon1         1
  [2]        1 [13242, 13618]      + |         2 Si016158m.exon2         2

$NA
GRanges object with 5 ranges and 3 metadata columns:
      seqnames         ranges strand | exon_id       exon_name exon_rank
  [1]        1 [55108, 55372]      + |       3 Si018185m.exon1         1
  [2]        1 [56363, 56499]      + |       4 Si018185m.exon2         2
  [3]        1 [56589, 56675]      + |       5 Si018185m.exon3         3
  [4]        1 [56777, 56935]      + |       6 Si018185m.exon4         4
  [5]        1 [57067, 57492]      + |       7 Si018185m.exon5         5

$NA
GRanges object with 2 ranges and 3 metadata columns:
      seqnames         ranges strand | exon_id       exon_name exon_rank
  [1]        1 [74818, 76150]      + |       8 Si017407m.exon1         1
  [2]        1 [76236, 77420]      + |      10 Si017407m.exon2         2

...
<40596 more elements>
-------
seqinfo: 76 sequences from an unspecified genome; no seqlengths
>
sessionInfo()  # same as in my previous question in this thread.


Note: I find 40599 records with 'transcript' in column 3 of the gff3 file, **matching** the number of GRangesList objects in grlsitalica.


Thanks for comments,

ADD REPLY
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.4 years ago
United States

1) The **goal** for my submitted questions is to follow the procedure in the 'C. Differential Expression' presentation, April 06-07, 2015, bioconductor workshop, to prepare a SummarizedExperiment for subsequent analysis using DESeq2.

Can I accomplish this goal starting with Setaria_italica.JGIv2.0.26.gff3?

2) url for page with Setaria italica gff3 is
ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/setaria_italica/

3) I note a TxDb object (for Setaria_italica.JGIv2.0.26.gff3) selected by "tx" rather than 'gene' gives 40599 GRangesList objects rather than 0 (reference my previous submitted question on this thread).

> grlsitalica <- exonsBy(txdb, "tx", use.names=TRUE)
Warning message:
In .set.group.names(ans, use.names, txdb, by) :
  some group names are NAs or duplicated
> grlsitalica
GRangesList object of length 40599:
$NA
GRanges object with 2 ranges and 3 metadata columns:
      seqnames         ranges strand |   exon_id       exon_name exon_rank
         <Rle>      <IRanges>  <Rle> | <integer>     <character> <integer>
  [1]        1 [10197, 13134]      + |         1 Si016158m.exon1         1
  [2]        1 [13242, 13618]      + |         2 Si016158m.exon2         2

$NA
GRanges object with 5 ranges and 3 metadata columns:
      seqnames         ranges strand | exon_id       exon_name exon_rank
  [1]        1 [55108, 55372]      + |       3 Si018185m.exon1         1
  [2]        1 [56363, 56499]      + |       4 Si018185m.exon2         2
  [3]        1 [56589, 56675]      + |       5 Si018185m.exon3         3
  [4]        1 [56777, 56935]      + |       6 Si018185m.exon4         4
  [5]        1 [57067, 57492]      + |       7 Si018185m.exon5         5

$NA
GRanges object with 2 ranges and 3 metadata columns:
      seqnames         ranges strand | exon_id       exon_name exon_rank
  [1]        1 [74818, 76150]      + |       8 Si017407m.exon1         1
  [2]        1 [76236, 77420]      + |      10 Si017407m.exon2         2

...
<40596 more elements>
-------
seqinfo: 76 sequences from an unspecified genome; no seqlengths
>
sessionInfo()  # same as in my previous question in this thread.


Note: I find 40599 records with 'transcript' in column 3 of the gff3 file, **matching** the number of GRangesList objects in grlsitalica.


Thanks for comments,

ADD COMMENT

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6