Building sqlite file from gtf
3
0
Entering edit mode
@patrick-schorderet-6081
Last seen 8.9 years ago
United States

Hi all,

 

I am trying to download/create a sqlite file for mm9 for some RNAseq analysis. 

So far, I have tried various things, none of which have worked. The latest strategy I have too is to download a .gtf file from NCBI and create a sqlite file as follows:

 

Step1: downlaod gtf file from here:  http://genome.ucsc.edu/cgi-bin/hgTables

Screen Shot 2014-12-16 at 16.51.12.png

 

 

Step1: Build .sqlite file as follows:

 

library(GenomicFeatures)

txdb <- makeTranscriptDbFromGFF(file="~/ReferenceFiles/mm9.gtf",

                                  format="gff3",

                                  dataSource="NCBI",

                                  species="Mus musculus")

 

saveDb(txdb, file=“~/ReferenceFiles/mm.sqlite”)

 

I get the follow gin error:

Error in .parse_attrCol(attrCol, file, colnames) : 

Some attributes do not conform to 'tag=value’ format

 

Any help would be greatly appreciated!

Thanks a lot in advance,

 

 

> sessionInfo()

R version 3.1.2 (2014-10-31)

Platform: x86_64-apple-darwin10.8.0 (64-bit)

 

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 

attached base packages:

 [1] grid      parallel  stats4    stats     graphics  grDevices utils    

 [8] datasets  methods   base     

 

other attached packages:

 [1] GenomicFeatures_1.18.2  VennDiagram_1.6.9       gtools_3.4.1           

 [4] edgeR_3.8.5             limma_3.22.1            GenomicAlignments_1.2.1

 [7] Rsamtools_1.18.2        Biostrings_2.34.0       XVector_0.6.0          

[10] GenomicRanges_1.18.3    AnnotationDbi_1.28.1    GenomeInfoDb_1.2.3     

[13] IRanges_2.0.0           S4Vectors_0.4.0         Biobase_2.26.0         

[16] BiocGenerics_0.12.1    

 

loaded via a namespace (and not attached):

 [1] base64enc_0.1-2    BatchJobs_1.5      BBmisc_1.8         BiocParallel_1.0.0

 [5] biomaRt_2.22.0     bitops_1.0-6       brew_1.0-6         checkmate_1.5.0   

 [9] codetools_0.2-9    DBI_0.3.1          digest_0.6.4       fail_1.2          

[13] foreach_1.4.2      iterators_1.0.7    RCurl_1.95-4.4     RSQLite_1.0.0     

[17] rtracklayer_1.26.2 sendmailR_1.2-1    stringr_0.6.2      tools_3.1.2       

[21] XML_3.98-1.1       zlibbioc_1.12.0   

gtf sqlite mm9 • 3.9k views
ADD COMMENT
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 4.9 years ago

Your input seems to be a GTF file, while you specify the GFF format for importing. Try setting the format argument to gtf instead of gff3.

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 16 hours ago
Seattle, WA, United States

Hi Patrick,

Any reason you don't use TxDb.Mmusculus.UCSC.mm9.knownGene?

library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
txdb
> txdb
# TxDb object:
# | Db type: TxDb
# | Supporting package: GenomicFeatures
# | Data source: UCSC
# | Genome: mm9
# | Organism: Mus musculus
# | UCSC Table: knownGene
# | Resource URL: http://genome.ucsc.edu/
# | Type of Gene ID: Entrez Gene ID
# | Full dataset: yes
# | miRBase build ID: NA
# | transcript_nrow: 55419
# | exon_nrow: 246570
# | cds_nrow: 213117
# | Db created by: GenomicFeatures package from Bioconductor
# | Creation time: 2014-09-26 11:19:12 -0700 (Fri, 26 Sep 2014)
# | GenomicFeatures version at creation time: 1.17.17
# | RSQLite version at creation time: 0.11.4
# | DBSCHEMAVERSION: 1.0

 

We have a bunch of pre-made TxDb packages. I would recommend that you always check here:

  http://bioconductor.org/packages/release/BiocViews.html#___TxDb

to see what we have in store before you try to make your own. If we don't have what you are looking for but your organism is supported by UCSC, try makeTxDbFromUCSC() to make a TxDb for a UCSC genome/track. For example the object in TxDb.Mmusculus.UCSC.mm9.knownGene was made with:

  txdb <- makeTxDbFromUCSC("mm9", "knownGene") 

Cheers,

H.

 

ADD COMMENT
0
Entering edit mode

Hi Hervé,

Thanks for the answer. This is indeed what I need, it is awesome!

Just wondering whether there is an easy way to get the gene names (for example have actin, gaped, etc) instead of numbers. 

 

library(TxDb.Mmusculus.UCSC.mm9.knownGene)
refGene <- TxDb.Mmusculus.UCSC.mm9.knownGene
refGene
eByg <- exonsBy(refGene, by=c("gene"))
length(eByg)
names(eByg)

I get the 'correct' ballpark number of genes (21761), but when I try to get the names, I get numbers instead of comprehensible genes names.

 

GRangesList object of length 21761:
$100009600 
GRanges object with 5 ranges and 2 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr9 [20866837, 20867161]      - |    129351        <NA>
  [2]     chr9 [20867338, 20867431]      - |    129352        <NA>
  [3]     chr9 [20867758, 20867840]      - |    129353        <NA>
  [4]     chr9 [20870468, 20870821]      - |    129354        <NA>
  [5]     chr9 [20871384, 20872369]      - |    129355        <NA>

 

 

Thanks for any help. 

p.

ADD REPLY
0
Entering edit mode
@patrick-schorderet-6081
Last seen 8.9 years ago
United States

Hi Hervé,

Thanks for the answer. This is indeed what I need, it is awesome!

Just wondering whether there is an easy way to get the gene names (for example have actin, gaped, etc) instead of numbers. 

 

library(TxDb.Mmusculus.UCSC.mm9.knownGene)
refGene <- TxDb.Mmusculus.UCSC.mm9.knownGene
refGene
eByg <- exonsBy(refGene, by=c("gene"))
length(eByg)
names(eByg)

I get the 'correct' ballpark number of genes (21761), but when I try to get the names, I get numbers instead of comprehensible genes names.

 

GRangesList object of length 21761:
$100009600 
GRanges object with 5 ranges and 2 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr9 [20866837, 20867161]      - |    129351        <NA>
  [2]     chr9 [20867338, 20867431]      - |    129352        <NA>
  [3]     chr9 [20867758, 20867840]      - |    129353        <NA>
  [4]     chr9 [20870468, 20870821]      - |    129354        <NA>
  [5]     chr9 [20871384, 20872369]      - |    129355        <NA>

 

 

Thanks for any help. 

p.

ADD COMMENT
1
Entering edit mode

Hi Patrick,

You have replied to Herve by typing in the box under 'Add your answer', rather than by clicking the 'Add Reply' link in his comment. You should probably do the latter, as you are adding a reply, rather than an answer.

To answer your question, the names of the GRangesList that you get are Entrez Gene IDs, which are by and large unique values for each gene. Gene symbols are not guaranteed to be unique, nor are you guaranteed to get one for each gene. Plus, you are using an old version of the mouse genome (circa 2007, which is ancient in genomics), and the annotation package that is supplied by Bioconductor is based on mm10.

So assuming you want to use a current genome build, you could do this:

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> library(org.Mm.eg.db)
> eByg <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
> symb <- select(org.Mm.eg.db, names(eByg), "SYMBOL")
> head(symb)
   ENTREZID        SYMBOL
1 100009600         Zglp1
2 100009609       Vmn2r65
3 100009614       Gm10024
4 100009664 F630206G17Rik
5    100012          Oog3
6    100017       Ldlrap1
> sum(is.na(symb[,2]))
[1] 0
> any(duplicated(symb[,2]))
[1] FALSE

> all.equal(names(eByg), symb[,1])
[1] TRUE
> names(eByg) <- as.character(symb[,2])
> eByg
GRangesList object of length 23725:
$Zglp1
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr9 [21062393, 21062717]      - |    133241        <NA>
  [2]     chr9 [21062894, 21062987]      - |    133242        <NA>
  [3]     chr9 [21063314, 21063396]      - |    133243        <NA>
  [4]     chr9 [21066024, 21066377]      - |    133244        <NA>
  [5]     chr9 [21066940, 21067925]      - |    133245        <NA>
  [6]     chr9 [21068030, 21068117]      - |    133246        <NA>
  [7]     chr9 [21073075, 21075496]      - |    133248        <NA>

$Vmn2r65
GRanges object with 6 ranges and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]     chr7 [84940169, 84941088]      - |  108923      <NA>
  [2]     chr7 [84943141, 84943264]      - |  108924      <NA>
  [3]     chr7 [84943504, 84943722]      - |  108925      <NA>
  [4]     chr7 [84946200, 84947000]      - |  108926      <NA>
  [5]     chr7 [84947372, 84947651]      - |  108927      <NA>
  [6]     chr7 [84963816, 84964009]      - |  108928      <NA>

$Gm10024
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]    chr10 [77711446, 77712009]      + |  142611      <NA>

...
<23722 more elements>
-------
seqinfo: 66 sequences (1 circular) from mm10 genome

If we try to mix the mm9 TxDb with the current org.Mm.eg.db, we run into problems:

> eByg2 <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, by="gene")
> symb2 <- select(org.Mm.eg.db, names(eByg2), "SYMBOL")
> sum(is.na(symb2[,2]))
[1] 90
> sum(duplicated(symb2[,2]))
[1] 89

I suppose you could do something to get around these problems, but using the most current genome build seems the most logical thing to do.

ADD REPLY
0
Entering edit mode

Hi James,

Thanks for the answer. I think you are right and I will use the newest version. The code seems to work fine - I just have to realign my data now... :-(

ADD REPLY

Login before adding your answer.

Traffic: 950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6