Hi all,
I am trying to download/create a sqlite file for mm9 for some RNAseq analysis.
So far, I have tried various things, none of which have worked. The latest strategy I have too is to download a .gtf file from NCBI and create a sqlite file as follows:
Step1: downlaod gtf file from here: http://genome.ucsc.edu/cgi-bin/hgTables
Step1: Build .sqlite file as follows:
library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF(file="~/ReferenceFiles/mm9.gtf",
format="gff3",
dataSource="NCBI",
species="Mus musculus")
saveDb(txdb, file=“~/ReferenceFiles/mm.sqlite”)
I get the follow gin error:
Error in .parse_attrCol(attrCol, file, colnames) :
Some attributes do not conform to 'tag=value’ format
Any help would be greatly appreciated!
Thanks a lot in advance,
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] GenomicFeatures_1.18.2 VennDiagram_1.6.9 gtools_3.4.1
[4] edgeR_3.8.5 limma_3.22.1 GenomicAlignments_1.2.1
[7] Rsamtools_1.18.2 Biostrings_2.34.0 XVector_0.6.0
[10] GenomicRanges_1.18.3 AnnotationDbi_1.28.1 GenomeInfoDb_1.2.3
[13] IRanges_2.0.0 S4Vectors_0.4.0 Biobase_2.26.0
[16] BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.0
[5] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.0
[9] codetools_0.2-9 DBI_0.3.1 digest_0.6.4 fail_1.2
[13] foreach_1.4.2 iterators_1.0.7 RCurl_1.95-4.4 RSQLite_1.0.0
[17] rtracklayer_1.26.2 sendmailR_1.2-1 stringr_0.6.2 tools_3.1.2
[21] XML_3.98-1.1 zlibbioc_1.12.0
Hi Hervé,
Thanks for the answer. This is indeed what I need, it is awesome!
Just wondering whether there is an easy way to get the gene names (for example have actin, gaped, etc) instead of numbers.
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
refGene <- TxDb.Mmusculus.UCSC.mm9.knownGene
refGene
eByg <- exonsBy(refGene, by=c("gene"))
length(eByg)
names(eByg)
I get the 'correct' ballpark number of genes (21761), but when I try to get the names, I get numbers instead of comprehensible genes names.
GRangesList object of length 21761:
$100009600
GRanges object with 5 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr9 [20866837, 20867161] - | 129351 <NA>
[2] chr9 [20867338, 20867431] - | 129352 <NA>
[3] chr9 [20867758, 20867840] - | 129353 <NA>
[4] chr9 [20870468, 20870821] - | 129354 <NA>
[5] chr9 [20871384, 20872369] - | 129355 <NA>
Thanks for any help.
p.