Entering edit mode
Guido Hooiveld
★
4.1k
@guido-hooiveld-2020
Last seen 9 days ago
Wageningen University, Wageningen, the …
Hello,
I would like to make a transcript database for Chinese hamster
(Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since
it is the first time I use this function, I do this per the code in
the vignette. However, I face some problems and would appreciate any
pointer.
I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from
NCBI.
[ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/]
Creating the txdb seems to go fine:
> txdb <-
makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3",
format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetul
us_griseus/GFF/", species="Cricetulus griseus")
extracting transcript information
Extracting gene IDs
extracting transcript information
Processing splicing information for gff3 file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Now generating chrominfo from available sequence names. No chromosome
length information is available.
Warning messages:
1: In .deduceExonRankings(exs, format = "gff") :
Infering Exon Rankings. If this is not what you expected, then
please be sure that you have provided a valid attribute for
exonRankAttributeName
2: In matchCircularity(chroms, circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.
# Characterization txdb
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source:
ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/
| Organism: Cricetulus griseus
| miRBase build ID: NA
| transcript_nrow: 21612
| exon_nrow: 212163
| cds_nrow: 183531
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013)
| GenomicFeatures version at creation time: 1.13.29
| RSQLite version at creation time: 0.11.4
| DBSCHEMAVERSION: 1.0
>
> columns(txdb)
[1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART"
"CDSEND" "EXONID"
[8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND"
"GENEID" "TXID"
[15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART"
"TXEND"
> keytypes(txdb)
[1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID"
"CDSNAME"
>
As a test, I would like to retrieve the transcript names for 3 Entrez
Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5
-listed IDs in the GFF.
Unfortunately, this doesn't seem to work:
> keys <- c("100754456", "100759368", "100760238")
> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID")
[1] GENEID TXNAME
<0 rows> (or 0-length row.names)
>
Note: the human example from the vignette worked fine.
I think this is somehow due to the fact that during the parsing of the
GFF the keytypes (?) are not properly recognized/linked. This is also
reflected by the plain use of 'rna' and 'gene' as names/IDs:
> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id",
"cds_id","cds_name"))
> head(GR)
GRanges with 6 ranges and 5 metadata columns:
seqnames ranges strand | tx_id tx_name
gene_id cds_id
<rle> <iranges> <rle> | <integer> <character>
<characterlist> <integerlist>
[1] NW_003613580.1 [ 736932, 738552] + | 1 rna0
gene2 1,2
[2] NW_003613580.1 [2073869, 2085513] + | 2 rna4
gene6 3,4,5,...
[3] NW_003613580.1 [2143936, 2241368] + | 3 rna6
gene8 8,9,10,...
[4] NW_003613580.1 [2390123, 2449470] + | 4 rna8
gene10 22,23,24,...
[5] NW_003613580.1 [2390123, 2555467] + | 5 rna7
gene10 22,23,24,...
[6] NW_003613580.1 [3070833, 3147709] + | 6 rna9
gene12 43,44,45,...
cds_name
<characterlist>
[1] NA
[2] NA
[3] NA
[4] NA
[5] NA
[6] NA
---
seqlengths:
NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1
NW_003704321.1 NW_003717424.1
NA NA NA ... NA
NA NA
>
> tail(GR)
GRanges with 6 ranges and 5 metadata columns:
seqnames ranges strand | tx_id tx_name
gene_id cds_id
<rle> <iranges> <rle> | <integer> <character>
<characterlist> <integerlist>
[1] NW_003694533.1 [29, 205] - | 21607 rna22504
gene25148 183527
[2] NW_003697941.1 [ 4, 135] - | 21608 rna22505
gene25149 183528
[3] NW_003698068.1 [ 1, 267] + | 21609 rna22506
gene25150 183529
[4] NW_003698075.1 [ 1, 267] - | 21610 rna22507
gene25151 NA
[5] NW_003704321.1 [24, 203] - | 21611 rna22508
gene25152 183530
[6] NW_003717424.1 [10, 189] + | 21612 rna22509
gene25155 183531
cds_name
<characterlist>
[1] NA
[2] NA
[3] NA
[4] NA
[5] NA
[6] NA
---
seqlengths:
NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1
NW_003704321.1 NW_003717424.1
NA NA NA ... NA
NA NA
>
Any suggestions on how to properly generate a txdb from this GFF would
be appreciated.
Thanks,
Guido
> sessionInfo()
R Under development (unstable) (2013-08-10 r63532)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6
GenomicRanges_1.13.36
[5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3
loaded via a namespace (and not attached):
[1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5
BSgenome_1.29.1 DBI_0.2-7
[6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4
rtracklayer_1.21.9 stats4_3.1.0
[11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0
>
---------------------------------------------------------
Guido Hooiveld, PhD
Nutrition, Metabolism & Genomics Group
Division of Human Nutrition
Wageningen University
Biotechnion, Bomenweg 2
NL-6703 HD Wageningen
the Netherlands
tel: (+)31 317 485788
fax: (+)31 317 483342
email: guido.hooiveld@wur.nl
internet: http://nutrigene.4t.com
http://scholar.google.com/citations?user=qFHaMnoAAAAJ
http://www.researcherid.com/rid/F-4912-2010
[[alternative HTML version deleted]]