problem generating Chinese hamster txdb from GFF file

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 9 days ago

Wageningen University, Wageningen, the …

Hello, I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer. I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI. [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/] Creating the txdb seems to go fine: > txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetul us_griseus/GFF/", species="Cricetulus griseus") extracting transcript information Extracting gene IDs extracting transcript information Processing splicing information for gff3 file. Deducing exon rank from relative coordinates provided Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Warning messages: 1: In .deduceExonRankings(exs, format = "gff") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName 2: In matchCircularity(chroms, circ_seqs) : None of the strings in your circ_seqs argument match your seqnames. # Characterization txdb > txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/ | Organism: Cricetulus griseus | miRBase build ID: NA | transcript_nrow: 21612 | exon_nrow: 212163 | cds_nrow: 183531 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013) | GenomicFeatures version at creation time: 1.13.29 | RSQLite version at creation time: 0.11.4 | DBSCHEMAVERSION: 1.0 > > columns(txdb) [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID" [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID" [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND" > keytypes(txdb) [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME" > As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF. Unfortunately, this doesn't seem to work: > keys <- c("100754456", "100759368", "100760238") > select(txdb, keys = keys, columns="TXNAME", keytype="GENEID") [1] GENEID TXNAME <0 rows> (or 0-length row.names) > Note: the human example from the vignette worked fine. I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs: > GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name")) > head(GR) GRanges with 6 ranges and 5 metadata columns: seqnames ranges strand | tx_id tx_name gene_id cds_id <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2 [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,... [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,... [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,... [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,... [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,... cds_name <characterlist> [1] NA [2] NA [3] NA [4] NA [5] NA [6] NA --- seqlengths: NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 NA NA NA ... NA NA NA > > tail(GR) GRanges with 6 ranges and 5 metadata columns: seqnames ranges strand | tx_id tx_name gene_id cds_id <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527 [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528 [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529 [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530 [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531 cds_name <characterlist> [1] NA [2] NA [3] NA [4] NA [5] NA [6] NA --- seqlengths: NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 NA NA NA ... NA NA NA > Any suggestions on how to properly generate a txdb from this GFF would be appreciated. Thanks, Guido > sessionInfo() R Under development (unstable) (2013-08-10 r63532) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36 [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3 loaded via a namespace (and not attached): [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0 [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 > --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld@wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 [[alternative HTML version deleted]]

GO GenomicFeatures GO GenomicFeatures • 2.4k views

ADD COMMENT • link updated 11.4 years ago by Marc Carlson ★ 7.2k • written 11.4 years ago by Guido Hooiveld ★ 4.1k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.4 years ago

United States

Hi Guido, The problem is that your keys are not valid. The keys for GENEID can be retrieved with the keys method like this: k <- keys(txdb, keytype="GENEID") And if we look at them we can see that they don't look anything like the keys you are trying to extract data for: head(k) [1] "gene10" "gene1000" "gene10000" "gene10001" "gene10002" "gene10003" So if you use keys like that (valid keys), then everything works fine. For example: select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID") GENEID TXNAME 1 gene10 rna8 2 gene10 rna7 3 gene1000 rna917 4 gene10000 rna9527 5 gene10001 rna9528 6 gene10002 rna9530 7 gene10002 rna9529 8 gene10003 rna9531 Also I noticed that you are getting the warning about deducing the exon ranks. Are you sure you want to do that with a mammal? This file contains a field called "exon_number" that looks like something you should probably try to use. And finally, since you seem to want to use a real gene ID, you will have to extract those from the file separately. This is because whoever created this file decided that they wanted to encode the actual gene symbols here in a peculiar way. So you can do that step separately by using the rtracklayer import() method like this: library(rtracklayer) result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3") If you look at what is in result, you will see (for example) that gene0 is cross referenced to GeneID:100754456 (for example). Hope this clarifies things, Marc On 08/15/2013 10:20 AM, Hooiveld, Guido wrote: > Hello, > I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer. > > I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI. > [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/] > > Creating the txdb seems to go fine: > >> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetul us_griseus/GFF/", species="Cricetulus griseus") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Warning messages: > 1: In .deduceExonRankings(exs, format = "gff") : > Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName > 2: In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. > > # Characterization txdb >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/ > | Organism: Cricetulus griseus > | miRBase build ID: NA > | transcript_nrow: 21612 > | exon_nrow: 212163 > | cds_nrow: 183531 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013) > | GenomicFeatures version at creation time: 1.13.29 > | RSQLite version at creation time: 0.11.4 > | DBSCHEMAVERSION: 1.0 >> columns(txdb) > [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID" > [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID" > [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND" >> keytypes(txdb) > [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME" > As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF. > Unfortunately, this doesn't seem to work: >> keys <- c("100754456", "100759368", "100760238") >> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID") > [1] GENEID TXNAME > <0 rows> (or 0-length row.names) > Note: the human example from the vignette worked fine. > > I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs: >> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name")) >> head(GR) > GRanges with 6 ranges and 5 metadata columns: > seqnames ranges strand | tx_id tx_name gene_id cds_id > <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> > [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2 > [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,... > [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,... > [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,... > [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,... > [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,... > cds_name > <characterlist> > [1] NA > [2] NA > [3] NA > [4] NA > [5] NA > [6] NA > --- > seqlengths: > NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 > NA NA NA ... NA NA NA >> tail(GR) > GRanges with 6 ranges and 5 metadata columns: > seqnames ranges strand | tx_id tx_name gene_id cds_id > <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> > [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527 > [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528 > [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529 > [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA > [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530 > [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531 > cds_name > <characterlist> > [1] NA > [2] NA > [3] NA > [4] NA > [5] NA > [6] NA > --- > seqlengths: > NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 > NA NA NA ... NA NA NA > > > Any suggestions on how to properly generate a txdb from this GFF would be appreciated. > > Thanks, > Guido > >> sessionInfo() > R Under development (unstable) (2013-08-10 r63532) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36 > [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3 > > loaded via a namespace (and not attached): > [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 > [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0 > [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 > > --------------------------------------------------------- > Guido Hooiveld, PhD > Nutrition, Metabolism & Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > email: guido.hooiveld at wur.nl > internet: http://nutrigene.4t.com > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > http://www.researcherid.com/rid/F-4912-2010 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, Thanks for helpful comments! Keys, keytypes, columns, select,.... still difficult for me... Moreover, it is the first time I am working with these type of files, so I still have to get used to all concepts and do's & don'ts. Anyway, I tried to regenerate the txdb utilizing the info on exon ranks. Unfortunately, this doesn't work in my hands... Am I missing something obviously, or is something going seriously wrong? Note that the error thrown is slightly different between the production and dev versions of the packages (see below). I will also check the functions in rtracklayer. Regards, Guido Dev: > txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", exonRankAttributeName="exon_number", + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF F", species="Cricetulus griseus") extracting transcript information Extracting gene IDs extracting transcript information Processing splicing information for gff3 file. Error in if (any(splicings$exon_rank == 0) && !any(splicings$exon_rank < : missing value where TRUE/FALSE needed > > sessionInfo() R Under development (unstable) (2013-08-10 r63532) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 rtracklayer_1.21.9 GenomicRanges_1.13.36 XVector_0.1.0 IRanges_1.19.24 [8] BiocGenerics_0.7.4 BiocInstaller_1.11.4 loaded via a namespace (and not attached): [1] biomaRt_2.17.2 Biostrings_2.29.15 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.13.29 RSQLite_0.11.4 stats4_3.1.0 [10] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 > Production: > txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", exonRankAttributeName="exon_number", + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF F", species="Cricetulus griseus") extracting transcript information Extracting gene IDs extracting transcript information Processing splicing information for gff3 file. Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Error in .normargSplicings(splicings, transcripts_tx_id) : 'splicings$exon_rank' must be an integer vector with no NAs In addition: Warning message: In matchCircularity(chroms, circ_seqs) : None of the strings in your circ_seqs argument match your seqnames. > sessionInfo() R version 3.0.1 Patched (2013-06-05 r62877) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 GenomicFeatures_1.12.3 AnnotationDbi_1.22.6 Biobase_2.20.1 [5] GenomicRanges_1.12.4 IRanges_1.18.2 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.12.3 RSQLite_0.11.4 [9] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Marc Carlson Sent: Thursday, August 15, 2013 22:44 To: bioconductor at r-project.org Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF file Hi Guido, The problem is that your keys are not valid. The keys for GENEID can be retrieved with the keys method like this: k <- keys(txdb, keytype="GENEID") And if we look at them we can see that they don't look anything like the keys you are trying to extract data for: head(k) [1] "gene10" "gene1000" "gene10000" "gene10001" "gene10002" "gene10003" So if you use keys like that (valid keys), then everything works fine. For example: select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID") GENEID TXNAME 1 gene10 rna8 2 gene10 rna7 3 gene1000 rna917 4 gene10000 rna9527 5 gene10001 rna9528 6 gene10002 rna9530 7 gene10002 rna9529 8 gene10003 rna9531 Also I noticed that you are getting the warning about deducing the exon ranks. Are you sure you want to do that with a mammal? This file contains a field called "exon_number" that looks like something you should probably try to use. And finally, since you seem to want to use a real gene ID, you will have to extract those from the file separately. This is because whoever created this file decided that they wanted to encode the actual gene symbols here in a peculiar way. So you can do that step separately by using the rtracklayer import() method like this: library(rtracklayer) result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3") If you look at what is in result, you will see (for example) that gene0 is cross referenced to GeneID:100754456 (for example). Hope this clarifies things, Marc On 08/15/2013 10:20 AM, Hooiveld, Guido wrote: > Hello, > I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer. > > I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI. > [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/] > > Creating the txdb seems to go fine: > >> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetul us_griseus/GFF/", species="Cricetulus griseus") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Warning messages: > 1: In .deduceExonRankings(exs, format = "gff") : > Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName > 2: In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. > > # Characterization txdb >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/ > | Organism: Cricetulus griseus > | miRBase build ID: NA > | transcript_nrow: 21612 > | exon_nrow: 212163 > | cds_nrow: 183531 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013) > | GenomicFeatures version at creation time: 1.13.29 > | RSQLite version at creation time: 0.11.4 > | DBSCHEMAVERSION: 1.0 >> columns(txdb) > [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID" > [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID" > [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND" >> keytypes(txdb) > [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME" > As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF. > Unfortunately, this doesn't seem to work: >> keys <- c("100754456", "100759368", "100760238") >> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID") > [1] GENEID TXNAME > <0 rows> (or 0-length row.names) > Note: the human example from the vignette worked fine. > > I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs: >> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name")) >> head(GR) > GRanges with 6 ranges and 5 metadata columns: > seqnames ranges strand | tx_id tx_name gene_id cds_id > <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> > [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2 > [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,... > [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,... > [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,... > [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,... > [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,... > cds_name > <characterlist> > [1] NA > [2] NA > [3] NA > [4] NA > [5] NA > [6] NA > --- > seqlengths: > NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 > NA NA NA ... NA NA NA >> tail(GR) > GRanges with 6 ranges and 5 metadata columns: > seqnames ranges strand | tx_id tx_name gene_id cds_id > <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> > [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527 > [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528 > [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529 > [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA > [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530 > [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531 > cds_name > <characterlist> > [1] NA > [2] NA > [3] NA > [4] NA > [5] NA > [6] NA > --- > seqlengths: > NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 > NA NA NA ... NA NA NA > > > Any suggestions on how to properly generate a txdb from this GFF would be appreciated. > > Thanks, > Guido > >> sessionInfo() > R Under development (unstable) (2013-08-10 r63532) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36 > [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3 > > loaded via a namespace (and not attached): > [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 > [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0 > [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 > > --------------------------------------------------------- > Guido Hooiveld, PhD > Nutrition, Metabolism & Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > email: guido.hooiveld at wur.nl > internet: http://nutrigene.4t.com > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > http://www.researcherid.com/rid/F-4912-2010 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi Guido, Sorry for the delay. Your reply seems to have snuck past my radar. Below is the code I ended up using to process this file. I looked more closely at the file you are trying to parse and I noticed that there is something very strange about it. Many different features are labeled as being of type "exon", but only SOME of them actually have the exon ranking information. :( But what is really odd here is that the features which don't have an exon ranking don't seem to be the "same" as the ones that do. I would recommend that you get an explanation for that from the person who generated this file. Anyhow I also made another quick update to GenomicFeatures to allow you to better make use of the other IDs that are in there. So with version 1.13.33 you should be able to do this (for example), and get gene ids that may be closer to what you want. I aim to do a little more in this department, but I have some other things I have to do 1st. library(GenomicFeatures) txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", gffGeneIdAttributeName="Name", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF" , species="Cricetulus griseus") Marc On 08/16/2013 01:59 AM, Hooiveld, Guido wrote: > Hi Marc, > Thanks for helpful comments! > Keys, keytypes, columns, select,.... still difficult for me... Moreover, it is the first time I am working with these type of files, so I still have to get used to all concepts and do's & don'ts. > Anyway, I tried to regenerate the txdb utilizing the info on exon ranks. Unfortunately, this doesn't work in my hands... Am I missing something obviously, or is something going seriously wrong? Note that the error thrown is slightly different between the production and dev versions of the packages (see below). > > I will also check the functions in rtracklayer. > > Regards, > Guido > > > Dev: >> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", exonRankAttributeName="exon_number", > + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/ GFF", species="Cricetulus griseus") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Error in if (any(splicings$exon_rank == 0) && !any(splicings$exon_rank < : > missing value where TRUE/FALSE needed >> sessionInfo() > R Under development (unstable) (2013-08-10 r63532) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 rtracklayer_1.21.9 GenomicRanges_1.13.36 XVector_0.1.0 IRanges_1.19.24 > [8] BiocGenerics_0.7.4 BiocInstaller_1.11.4 > > loaded via a namespace (and not attached): > [1] biomaRt_2.17.2 Biostrings_2.29.15 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.13.29 RSQLite_0.11.4 stats4_3.1.0 > [10] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 > > Production: >> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", exonRankAttributeName="exon_number", > + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/ GFF", species="Cricetulus griseus") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Error in .normargSplicings(splicings, transcripts_tx_id) : > 'splicings$exon_rank' must be an integer vector with no NAs > In addition: Warning message: > In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. >> sessionInfo() > R version 3.0.1 Patched (2013-06-05 r62877) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 GenomicFeatures_1.12.3 AnnotationDbi_1.22.6 Biobase_2.20.1 > [5] GenomicRanges_1.12.4 IRanges_1.18.2 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.12.3 RSQLite_0.11.4 > [9] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Marc Carlson > Sent: Thursday, August 15, 2013 22:44 > To: bioconductor at r-project.org > Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF file > > Hi Guido, > > The problem is that your keys are not valid. The keys for GENEID can be retrieved with the keys method like this: > > k <- keys(txdb, keytype="GENEID") > > And if we look at them we can see that they don't look anything like the keys you are trying to extract data for: > > head(k) > [1] "gene10" "gene1000" "gene10000" "gene10001" "gene10002" "gene10003" > > > So if you use keys like that (valid keys), then everything works fine. > For example: > > select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID") > > GENEID TXNAME > 1 gene10 rna8 > 2 gene10 rna7 > 3 gene1000 rna917 > 4 gene10000 rna9527 > 5 gene10001 rna9528 > 6 gene10002 rna9530 > 7 gene10002 rna9529 > 8 gene10003 rna9531 > > > Also I noticed that you are getting the warning about deducing the exon ranks. Are you sure you want to do that with a mammal? This file contains a field called "exon_number" that looks like something you should probably try to use. > > And finally, since you seem to want to use a real gene ID, you will have to extract those from the file separately. This is because whoever created this file decided that they wanted to encode the actual gene symbols here in a peculiar way. So you can do that step separately by using the rtracklayer import() method like this: > > library(rtracklayer) > result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3") > > If you look at what is in result, you will see (for example) that gene0 is cross referenced to GeneID:100754456 (for example). > > > Hope this clarifies things, > > > Marc > > > > > On 08/15/2013 10:20 AM, Hooiveld, Guido wrote: >> Hello, >> I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer. >> >> I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI. >> [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/] >> >> Creating the txdb seems to go fine: >> >>> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetul us_griseus/GFF/", species="Cricetulus griseus") >> extracting transcript information >> Extracting gene IDs >> extracting transcript information >> Processing splicing information for gff3 file. >> Deducing exon rank from relative coordinates provided >> Prepare the 'metadata' data frame ... metadata: OK >> Now generating chrominfo from available sequence names. No chromosome length information is available. >> Warning messages: >> 1: In .deduceExonRankings(exs, format = "gff") : >> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName >> 2: In matchCircularity(chroms, circ_seqs) : >> None of the strings in your circ_seqs argument match your seqnames. >> >> # Characterization txdb >>> txdb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Supporting package: GenomicFeatures >> | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/ >> | Organism: Cricetulus griseus >> | miRBase build ID: NA >> | transcript_nrow: 21612 >> | exon_nrow: 212163 >> | cds_nrow: 183531 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013) >> | GenomicFeatures version at creation time: 1.13.29 >> | RSQLite version at creation time: 0.11.4 >> | DBSCHEMAVERSION: 1.0 >>> columns(txdb) >> [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID" >> [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID" >> [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND" >>> keytypes(txdb) >> [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME" >> As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF. >> Unfortunately, this doesn't seem to work: >>> keys <- c("100754456", "100759368", "100760238") >>> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID") >> [1] GENEID TXNAME >> <0 rows> (or 0-length row.names) >> Note: the human example from the vignette worked fine. >> >> I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs: >>> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name")) >>> head(GR) >> GRanges with 6 ranges and 5 metadata columns: >> seqnames ranges strand | tx_id tx_name gene_id cds_id >> <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> >> [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2 >> [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,... >> [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,... >> [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,... >> [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,... >> [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,... >> cds_name >> <characterlist> >> [1] NA >> [2] NA >> [3] NA >> [4] NA >> [5] NA >> [6] NA >> --- >> seqlengths: >> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 >> NA NA NA ... NA NA NA >>> tail(GR) >> GRanges with 6 ranges and 5 metadata columns: >> seqnames ranges strand | tx_id tx_name gene_id cds_id >> <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> >> [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527 >> [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528 >> [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529 >> [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA >> [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530 >> [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531 >> cds_name >> <characterlist> >> [1] NA >> [2] NA >> [3] NA >> [4] NA >> [5] NA >> [6] NA >> --- >> seqlengths: >> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 >> NA NA NA ... NA NA NA >> >> >> Any suggestions on how to properly generate a txdb from this GFF would be appreciated. >> >> Thanks, >> Guido >> >>> sessionInfo() >> R Under development (unstable) (2013-08-10 r63532) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36 >> [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 >> [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0 >> [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 >> >> --------------------------------------------------------- >> Guido Hooiveld, PhD >> Nutrition, Metabolism & Genomics Group >> Division of Human Nutrition >> Wageningen University >> Biotechnion, Bomenweg 2 >> NL-6703 HD Wageningen >> the Netherlands >> tel: (+)31 317 485788 >> fax: (+)31 317 483342 >> email: guido.hooiveld at wur.nl >> internet: http://nutrigene.4t.com >> http://scholar.google.com/citations?user=qFHaMnoAAAAJ >> http://www.researcherid.com/rid/F-4912-2010 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Marc, thanks for coming back to this. As said before, I obtained the GFF from the NCBI, and assumed these would comply with all standards. Apparently this is not the case for the CHO GFF, but remarkably the exon ranking is also not present in e.g. the latest GFF for the human genome available at the NCBI... ?? (ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/). Apparently, more people have noticed the lack of confirmation to the standards, see e.g.: http://blastedbio.blogspot.nl/2011/08/why-are- ncbi-gff3-files-still-broken.html Anyway, I will contact the NCBI regarding this issue, and report back if needed. Thanks again, Guido -----Original Message----- From: Marc Carlson [mailto:mcarlson@fhcrc.org] Sent: Wednesday, August 21, 2013 01:54 To: Hooiveld, Guido Cc: bioconductor at r-project.org Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF file Hi Guido, Sorry for the delay. Your reply seems to have snuck past my radar. Below is the code I ended up using to process this file. I looked more closely at the file you are trying to parse and I noticed that there is something very strange about it. Many different features are labeled as being of type "exon", but only SOME of them actually have the exon ranking information. :( But what is really odd here is that the features which don't have an exon ranking don't seem to be the "same" as the ones that do. I would recommend that you get an explanation for that from the person who generated this file. Anyhow I also made another quick update to GenomicFeatures to allow you to better make use of the other IDs that are in there. So with version 1.13.33 you should be able to do this (for example), and get gene ids that may be closer to what you want. I aim to do a little more in this department, but I have some other things I have to do 1st. library(GenomicFeatures) txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", gffGeneIdAttributeName="Name", dataSo urce="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF", species="Cricetulus griseus") Marc On 08/16/2013 01:59 AM, Hooiveld, Guido wrote: > Hi Marc, > Thanks for helpful comments! > Keys, keytypes, columns, select,.... still difficult for me... Moreover, it is the first time I am working with these type of files, so I still have to get used to all concepts and do's & don'ts. > Anyway, I tried to regenerate the txdb utilizing the info on exon ranks. Unfortunately, this doesn't work in my hands... Am I missing something obviously, or is something going seriously wrong? Note that the error thrown is slightly different between the production and dev versions of the packages (see below). > > I will also check the functions in rtracklayer. > > Regards, > Guido > > > Dev: >> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", >> format="gff3", exonRankAttributeName="exon_number", > + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF > + F", species="Cricetulus griseus") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Error in if (any(splicings$exon_rank == 0) && !any(splicings$exon_rank < : > missing value where TRUE/FALSE needed >> sessionInfo() > R Under development (unstable) (2013-08-10 r63532) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 rtracklayer_1.21.9 GenomicRanges_1.13.36 XVector_0.1.0 IRanges_1.19.24 > [8] BiocGenerics_0.7.4 BiocInstaller_1.11.4 > > loaded via a namespace (and not attached): > [1] biomaRt_2.17.2 Biostrings_2.29.15 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.13.29 RSQLite_0.11.4 stats4_3.1.0 > [10] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 > > Production: >> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", >> format="gff3", exonRankAttributeName="exon_number", > + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF > + F", species="Cricetulus griseus") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Prepare the 'metadata' data frame ... metadata: OK Now generating > chrominfo from available sequence names. No chromosome length information is available. > Error in .normargSplicings(splicings, transcripts_tx_id) : > 'splicings$exon_rank' must be an integer vector with no NAs In > addition: Warning message: > In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. >> sessionInfo() > R version 3.0.1 Patched (2013-06-05 r62877) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 GenomicFeatures_1.12.3 AnnotationDbi_1.22.6 Biobase_2.20.1 > [5] GenomicRanges_1.12.4 IRanges_1.18.2 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.12.3 RSQLite_0.11.4 > [9] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > > -----Original Message----- > From: bioconductor-bounces at r-project.org > [mailto:bioconductor-bounces at r-project.org] On Behalf Of Marc Carlson > Sent: Thursday, August 15, 2013 22:44 > To: bioconductor at r-project.org > Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF > file > > Hi Guido, > > The problem is that your keys are not valid. The keys for GENEID can be retrieved with the keys method like this: > > k <- keys(txdb, keytype="GENEID") > > And if we look at them we can see that they don't look anything like the keys you are trying to extract data for: > > head(k) > [1] "gene10" "gene1000" "gene10000" "gene10001" "gene10002" "gene10003" > > > So if you use keys like that (valid keys), then everything works fine. > For example: > > select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID") > > GENEID TXNAME > 1 gene10 rna8 > 2 gene10 rna7 > 3 gene1000 rna917 > 4 gene10000 rna9527 > 5 gene10001 rna9528 > 6 gene10002 rna9530 > 7 gene10002 rna9529 > 8 gene10003 rna9531 > > > Also I noticed that you are getting the warning about deducing the exon ranks. Are you sure you want to do that with a mammal? This file contains a field called "exon_number" that looks like something you should probably try to use. > > And finally, since you seem to want to use a real gene ID, you will have to extract those from the file separately. This is because whoever created this file decided that they wanted to encode the actual gene symbols here in a peculiar way. So you can do that step separately by using the rtracklayer import() method like this: > > library(rtracklayer) > result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3") > > If you look at what is in result, you will see (for example) that gene0 is cross referenced to GeneID:100754456 (for example). > > > Hope this clarifies things, > > > Marc > > > > > On 08/15/2013 10:20 AM, Hooiveld, Guido wrote: >> Hello, >> I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer. >> >> I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI. >> [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/] >> >> Creating the txdb seems to go fine: >> >>> txdb <- >>> makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", >>> format="gff3", >>> dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF >>> F/", species="Cricetulus griseus") >> extracting transcript information >> Extracting gene IDs >> extracting transcript information >> Processing splicing information for gff3 file. >> Deducing exon rank from relative coordinates provided Prepare the >> 'metadata' data frame ... metadata: OK Now generating chrominfo from >> available sequence names. No chromosome length information is available. >> Warning messages: >> 1: In .deduceExonRankings(exs, format = "gff") : >> Infering Exon Rankings. If this is not what you expected, then >> please be sure that you have provided a valid attribute for >> exonRankAttributeName >> 2: In matchCircularity(chroms, circ_seqs) : >> None of the strings in your circ_seqs argument match your seqnames. >> >> # Characterization txdb >>> txdb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Supporting package: GenomicFeatures Data source: >> | ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/ >> | Organism: Cricetulus griseus >> | miRBase build ID: NA >> | transcript_nrow: 21612 >> | exon_nrow: 212163 >> | cds_nrow: 183531 >> | Db created by: GenomicFeatures package from Bioconductor Creation >> | time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013) GenomicFeatures >> | version at creation time: 1.13.29 RSQLite version at creation time: >> | 0.11.4 >> | DBSCHEMAVERSION: 1.0 >>> columns(txdb) >> [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID" >> [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID" >> [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND" >>> keytypes(txdb) >> [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME" >> As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF. >> Unfortunately, this doesn't seem to work: >>> keys <- c("100754456", "100759368", "100760238") select(txdb, keys = >>> keys, columns="TXNAME", keytype="GENEID") >> [1] GENEID TXNAME >> <0 rows> (or 0-length row.names) >> Note: the human example from the vignette worked fine. >> >> I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs: >>> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", >>> "cds_id","cds_name")) >>> head(GR) >> GRanges with 6 ranges and 5 metadata columns: >> seqnames ranges strand | tx_id tx_name gene_id cds_id >> <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> >> [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2 >> [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,... >> [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,... >> [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,... >> [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,... >> [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,... >> cds_name >> <characterlist> >> [1] NA >> [2] NA >> [3] NA >> [4] NA >> [5] NA >> [6] NA >> --- >> seqlengths: >> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 >> NA NA NA ... NA NA NA >>> tail(GR) >> GRanges with 6 ranges and 5 metadata columns: >> seqnames ranges strand | tx_id tx_name gene_id cds_id >> <rle> <iranges> <rle> | <integer> <character> <characterlist> <integerlist> >> [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527 >> [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528 >> [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529 >> [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA >> [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530 >> [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531 >> cds_name >> <characterlist> >> [1] NA >> [2] NA >> [3] NA >> [4] NA >> [5] NA >> [6] NA >> --- >> seqlengths: >> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1 >> NA NA NA ... NA NA NA >> >> >> Any suggestions on how to properly generate a txdb from this GFF would be appreciated. >> >> Thanks, >> Guido >> >>> sessionInfo() >> R Under development (unstable) (2013-08-10 r63532) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36 >> [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 >> [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0 >> [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0 >> >> --------------------------------------------------------- >> Guido Hooiveld, PhD >> Nutrition, Metabolism & Genomics Group Division of Human Nutrition >> Wageningen University Biotechnion, Bomenweg 2 >> NL-6703 HD Wageningen >> the Netherlands >> tel: (+)31 317 485788 >> fax: (+)31 317 483342 >> email: guido.hooiveld at wur.nl >> internet: http://nutrigene.4t.com >> http://scholar.google.com/citations?user=qFHaMnoAAAAJ >> http://www.researcherid.com/rid/F-4912-2010 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.4 years ago Guido Hooiveld ★ 4.1k

Login before adding your answer.