Genomic Features - makeTranscriptDbFromGFF()
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.3 years ago
I am trying to use derfinder but it requires I have my genome features in the TranscriptDB from the GenomicFeatures R package. Normally one can use the inbuilt function to make a TranscriptDB from UNSC database however my organism, a plant, is not included in the UNSC database. Consequently I am trying to use the makeTranscriptDbFromGFF() function to make the TranscriptDB using a gtf file I created using tophat and cuffmerge on some RNA-Seq samples. However following the examples I can't get it to work. Here is what I've got: chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"), length= c(52991155, 45729672, 55515152, 56582383, 43630510, 35275713, 49172423,45569985), is_circular= rep(FALSE, 8)) exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf", format = "gtf", exonRankAttributeName="exon_number", chrominfo=chrominfo) Unfortunately this is the output: extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Prepare the 'metadata' data frame ... metadata: OK Error in .normargTranscripts(transcripts) : values in 'transcripts$tx_strand' must be "+" or "-" In addition: Warning message: In if is.na(chrominfo)) { : the condition has length > 1 and only the first element will be used What am I missing?? I also posted this question on biostar. This is the top of my gtf file: chr1 Cufflinks exon 6524 6620 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1"; chr1 Cufflinks exon 7098 7366 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1"; chr1 Cufflinks exon 14514 14556 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2"; chr1 Cufflinks exon 15503 15729 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2"; chr1 Cufflinks exon 16283 16326 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3"; chr1 Cufflinks exon 17061 17304 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3"; chr1 Cufflinks exon 18242 18382 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3"; -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C LC_TIME=en_NZ.UTF-8 [4] LC_COLLATE=en_NZ.UTF-8 LC_MONETARY=en_NZ.UTF-8 LC_MESSAGES=en_NZ.UTF-8 [7] LC_PAPER=en_NZ.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines grid parallel stats graphics grDevices utils datasets [9] methods base other attached packages: [1] rtracklayer_1.22.7 GenomicFeatures_1.14.5 AnnotationDbi_1.24.0 [4] Biobase_2.22.0 GenomicRanges_1.14.4 XVector_0.2.0 [7] derfinder_1.0.2 locfdr_1.1-7 HiddenMarkov_1.7-0 [10] limma_3.18.13 Genominator_1.16.0 GenomeGraphs_1.22.0 [13] biomaRt_2.18.0 IRanges_1.20.7 BiocGenerics_0.8.0 [16] RSQLite_0.11.4 DBI_0.2-7 loaded via a namespace (and not attached): [1] Biostrings_2.30.1 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 Rsamtools_1.14.3 [6] stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.
Organism TranscriptDb GenomicFeatures Organism TranscriptDb GenomicFeatures • 2.5k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States
Not an answer, but shouldn't the API expect a Seqinfo object instead of an arbitrary data.frame for chrominfo? Is this legacy? Michael On Mon, Jun 16, 2014 at 2:30 AM, Geoffrey Thomson [guest] < guest@bioconductor.org> wrote: > I am trying to use derfinder but it requires I have my genome features in > the TranscriptDB from the GenomicFeatures R package. > > Normally one can use the inbuilt function to make a TranscriptDB from UNSC > database however my organism, a plant, is not included in the UNSC database. > > Consequently I am trying to use the makeTranscriptDbFromGFF() function to > make the TranscriptDB using a gtf file I created using tophat and cuffmerge > on some RNA-Seq samples. However following the examples I can't get it to > work. > > Here is what I've got: > > chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", > "chr6", "chr7", "chr8"), > length= c(52991155, 45729672, 55515152, 56582383, > 43630510, 35275713, 49172423,45569985), > is_circular= rep(FALSE, 8)) > > exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf", > format = "gtf", > exonRankAttributeName="exon_number", > chrominfo=chrominfo) > > Unfortunately this is the output: > > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Prepare the 'metadata' data frame ... metadata: OK > Error in .normargTranscripts(transcripts) : > values in 'transcripts$tx_strand' must be "+" or "-" > In addition: Warning message: > In if is.na(chrominfo)) { : > the condition has length > 1 and only the first element will be used > > What am I missing?? I also posted this question on biostar. > > This is the top of my gtf file: > > chr1 Cufflinks exon 6524 6620 . + . gene_id > "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name > "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; > class_code "="; tss_id "TSS1"; p_id "P1"; > chr1 Cufflinks exon 7098 7366 . + . gene_id > "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name > "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; > class_code "="; tss_id "TSS1"; p_id "P1"; > chr1 Cufflinks exon 14514 14556 . + . gene_id > "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name > "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; > class_code "="; tss_id "TSS2"; p_id "P2"; > chr1 Cufflinks exon 15503 15729 . + . gene_id > "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name > "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; > class_code "="; tss_id "TSS2"; p_id "P2"; > chr1 Cufflinks exon 16283 16326 . + . gene_id > "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name > "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; > class_code "="; tss_id "TSS3"; p_id "P3"; > chr1 Cufflinks exon 17061 17304 . + . gene_id > "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name > "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; > class_code "="; tss_id "TSS3"; p_id "P3"; > chr1 Cufflinks exon 18242 18382 . + . gene_id > "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name > "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; > class_code "="; tss_id "TSS3"; p_id "P3"; > > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C > LC_TIME=en_NZ.UTF-8 > [4] LC_COLLATE=en_NZ.UTF-8 LC_MONETARY=en_NZ.UTF-8 > LC_MESSAGES=en_NZ.UTF-8 > [7] LC_PAPER=en_NZ.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_NZ.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] splines grid parallel stats graphics grDevices utils > datasets > [9] methods base > > other attached packages: > [1] rtracklayer_1.22.7 GenomicFeatures_1.14.5 AnnotationDbi_1.24.0 > [4] Biobase_2.22.0 GenomicRanges_1.14.4 XVector_0.2.0 > [7] derfinder_1.0.2 locfdr_1.1-7 HiddenMarkov_1.7-0 > [10] limma_3.18.13 Genominator_1.16.0 GenomeGraphs_1.22.0 > [13] biomaRt_2.18.0 IRanges_1.20.7 BiocGenerics_0.8.0 > [16] RSQLite_0.11.4 DBI_0.2-7 > > loaded via a namespace (and not attached): > [1] Biostrings_2.30.1 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 > Rsamtools_1.14.3 > [6] stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.8.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.4 years ago
United States
Hi, I can't really reproduce this from the data you have given me, because it is being caused by something that must happen later in your file (missing or mis-labeled strand information). The function you are calling has the job of trying to turn your gtf file into a TranscriptDb object. And the error you report indicates that your source file must be missing the strand information for some of the transcripts (further down in the part of the file that you didn't share with us). You can see what is expected for GTF files here: http://uswest.ensembl.org/info/website/upload/gff.html I have modified the function in the devel branch to allow it to filter out bad rows like this and issue a warning in an effort to make this more convenient in the future. If you don't want to use the devel version of the GenomicFeatures package to do this for you, you could also just remove the offending lines from your file (anything where the 7th field is not a "+" or a "-"). I hope this helps, Marc On 06/16/2014 02:30 AM, Maintainer wrote: > I am trying to use derfinder but it requires I have my genome features in the TranscriptDB from the GenomicFeatures R package. > > Normally one can use the inbuilt function to make a TranscriptDB from UNSC database however my organism, a plant, is not included in the UNSC database. > > Consequently I am trying to use the makeTranscriptDbFromGFF() function to make the TranscriptDB using a gtf file I created using tophat and cuffmerge on some RNA-Seq samples. However following the examples I can't get it to work. > > Here is what I've got: > > chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"), > length= c(52991155, 45729672, 55515152, 56582383, 43630510, 35275713, 49172423,45569985), > is_circular= rep(FALSE, 8)) > > exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf", > format = "gtf", > exonRankAttributeName="exon_number", > chrominfo=chrominfo) > > Unfortunately this is the output: > > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Prepare the 'metadata' data frame ... metadata: OK > Error in .normargTranscripts(transcripts) : > values in 'transcripts$tx_strand' must be "+" or "-" > In addition: Warning message: > In if is.na(chrominfo)) { : > the condition has length > 1 and only the first element will be used > > What am I missing?? I also posted this question on biostar. > > This is the top of my gtf file: > > chr1 Cufflinks exon 6524 6620 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1"; > chr1 Cufflinks exon 7098 7366 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1"; > chr1 Cufflinks exon 14514 14556 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2"; > chr1 Cufflinks exon 15503 15729 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2"; > chr1 Cufflinks exon 16283 16326 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3"; > chr1 Cufflinks exon 17061 17304 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3"; > chr1 Cufflinks exon 18242 18382 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3"; > > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C LC_TIME=en_NZ.UTF-8 > [4] LC_COLLATE=en_NZ.UTF-8 LC_MONETARY=en_NZ.UTF-8 LC_MESSAGES=en_NZ.UTF-8 > [7] LC_PAPER=en_NZ.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] splines grid parallel stats graphics grDevices utils datasets > [9] methods base > > other attached packages: > [1] rtracklayer_1.22.7 GenomicFeatures_1.14.5 AnnotationDbi_1.24.0 > [4] Biobase_2.22.0 GenomicRanges_1.14.4 XVector_0.2.0 > [7] derfinder_1.0.2 locfdr_1.1-7 HiddenMarkov_1.7-0 > [10] limma_3.18.13 Genominator_1.16.0 GenomeGraphs_1.22.0 > [13] biomaRt_2.18.0 IRanges_1.20.7 BiocGenerics_0.8.0 > [16] RSQLite_0.11.4 DBI_0.2-7 > > loaded via a namespace (and not attached): > [1] Biostrings_2.30.1 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 Rsamtools_1.14.3 > [6] stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.8.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
ADD COMMENT

Login before adding your answer.

Traffic: 509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6