Search
Question: GenomicFeatures: Problem with makeTranscriptDbFromGFF
0
gravatar for Katja Hebestreit
4.2 years ago by
United States
Katja Hebestreit110 wrote:
Hello, I get an error when I try to import my gff file: txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") Error in .parse_attrCol(attrCol, file, colnames) : Some attributes do not conform to 'tag value' format This is how my file looks like: chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 gene_id "Xkr4"; transcript_id "Xkr4"; I have the feeling that this has something to do with the missing exon rank information in my file. Is that true? Is there a way to import this file? All I want to do is to determine the gene lengths. Could anyone help? That would be awesome! Cheers, Katja sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 [10] DBI_0.2-7 digest_0.6.4 fail_1.2 [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 [28] zlibbioc_1.10.0
ADD COMMENTlink modified 4.2 years ago by Martin Morgan ♦♦ 21k • written 4.2 years ago by Katja Hebestreit110
0
gravatar for Michael Lawrence
4.2 years ago by
Michael Lawrence10.0k
United States
Michael Lawrence10.0k wrote:
On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah@stanford.edu>wrote: > Hello, > > I get an error when I try to import my gff file: > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > Error in .parse_attrCol(attrCol, file, colnames) : > Some attributes do not conform to 'tag value' format > > This is how my file looks like: > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > . gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > gene_id "Xkr4"; transcript_id "Xkr4"; > > I have the feeling that this has something to do with the missing exon > rank information in my file. Is that true? Is there a way to import this > file? All I want to do is to determine the gene lengths. > It is most likely as the error says: some of your attributes are malformed. Is that the entire file listed above, or is there more? If you could get me the file somehow I could diagnose the issue. > > Could anyone help? That would be awesome! > Cheers, > Katja > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > [28] zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 4.2 years ago by Michael Lawrence10.0k
Actually, the error was not reproducible with the lines I attached. But it is reproducible with those lines (four additional lines): chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat exon 3660633 3661579 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4"; chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 - . gene_id "Rp1"; transcript_id "Rp1"; chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - 2 gene_id "Rp1"; transcript_id "Rp1"; Let me know if you like to get the entire file. Thank you!! Katja ----- Original Message ----- From: "Michael Lawrence" <lawrence.michael@gene.com> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" <maintainer at="" bioconductor.org=""> Sent: Sunday, April 13, 2014 10:02:13 PM Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote: > Hello, > > I get an error when I try to import my gff file: > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > Error in .parse_attrCol(attrCol, file, colnames) : > Some attributes do not conform to 'tag value' format > > This is how my file looks like: > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > . gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > gene_id "Xkr4"; transcript_id "Xkr4"; > > I have the feeling that this has something to do with the missing exon > rank information in my file. Is that true? Is there a way to import this > file? All I want to do is to determine the gene lengths. > It is most likely as the error says: some of your attributes are malformed. Is that the entire file listed above, or is there more? If you could get me the file somehow I could diagnose the issue. > > Could anyone help? That would be awesome! > Cheers, > Katja > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > [28] zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 4.2 years ago by Katja Hebestreit110
Well, I copied the text and replaced the spaces with tabs as appropriate and everything seemed to work fine, so you might to attach that fragment of the file, just to be sure it isn't a formatting issue. Does import("file.gtf") work for you? If so, that should be good enough for your use case. Michael On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <katjah@stanford.edu>wrote: > Actually, the error was not reproducible with the lines I attached. But it > is reproducible with those lines (four additional lines): > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > . gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 - > . gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3660633 3661579 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 - > . gene_id "Rp1"; transcript_id "Rp1"; > chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - 2 > gene_id "Rp1"; transcript_id "Rp1"; > > Let me know if you like to get the entire file. > > Thank you!! > Katja > > ----- Original Message ----- > From: "Michael Lawrence" <lawrence.michael@gene.com> > To: "Katja Hebestreit" <katjah@stanford.edu> > Cc: bioconductor@r-project.org, "Rsamtools Maintainer" < > maintainer@bioconductor.org> > Sent: Sunday, April 13, 2014 10:02:13 PM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah@stanford.edu> >wrote: > > > Hello, > > > > I get an error when I try to import my gff file: > > > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > > > Error in .parse_attrCol(attrCol, file, colnames) : > > Some attributes do not conform to 'tag value' format > > > > This is how my file looks like: > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > I have the feeling that this has something to do with the missing exon > > rank information in my file. Is that true? Is there a way to import this > > file? All I want to do is to determine the gene lengths. > > > > It is most likely as the error says: some of your attributes are malformed. > Is that the entire file listed above, or is there more? If you could get me > the file somehow I could diagnose the issue. > > > > > > Could anyone help? That would be awesome! > > Cheers, > > Katja > > > > > > sessionInfo() > > R version 3.1.0 (2014-04-10) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > > > loaded via a namespace (and not attached): > > [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 > > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > > [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 > > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > > [28] zlibbioc_1.10.0 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]
ADD REPLYlink written 4.2 years ago by Michael Lawrence10.0k
You can download the file here: https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf Using file I get the error: txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") Error in .parse_attrCol(attrCol, file, colnames) : Some attributes do not conform to 'tag value' format Thank you so much for helping!! Katja ----- Original Message ----- From: "Michael Lawrence" <lawrence.michael@gene.com> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, bioconductor at r-project.org, "Rsamtools Maintainer" <maintainer at="" bioconductor.org=""> Sent: Monday, April 14, 2014 7:27:26 AM Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF Well, I copied the text and replaced the spaces with tabs as appropriate and everything seemed to work fine, so you might to attach that fragment of the file, just to be sure it isn't a formatting issue. Does import("file.gtf") work for you? If so, that should be good enough for your use case. Michael On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote: > Actually, the error was not reproducible with the lines I attached. But it > is reproducible with those lines (four additional lines): > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > . gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 - > . gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat exon 3660633 3661579 0.000000 - . > gene_id "Xkr4"; transcript_id "Xkr4"; > chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 - > . gene_id "Rp1"; transcript_id "Rp1"; > chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - 2 > gene_id "Rp1"; transcript_id "Rp1"; > > Let me know if you like to get the entire file. > > Thank you!! > Katja > > ----- Original Message ----- > From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > To: "Katja Hebestreit" <katjah at="" stanford.edu=""> > Cc: bioconductor at r-project.org, "Rsamtools Maintainer" < > maintainer at bioconductor.org> > Sent: Sunday, April 13, 2014 10:02:13 PM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah at="" stanford.edu=""> >wrote: > > > Hello, > > > > I get an error when I try to import my gff file: > > > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > > > Error in .parse_attrCol(attrCol, file, colnames) : > > Some attributes do not conform to 'tag value' format > > > > This is how my file looks like: > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > I have the feeling that this has something to do with the missing exon > > rank information in my file. Is that true? Is there a way to import this > > file? All I want to do is to determine the gene lengths. > > > > It is most likely as the error says: some of your attributes are malformed. > Is that the entire file listed above, or is there more? If you could get me > the file somehow I could diagnose the issue. > > > > > > Could anyone help? That would be awesome! > > Cheers, > > Katja > > > > > > sessionInfo() > > R version 3.1.0 (2014-04-10) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > > > loaded via a namespace (and not attached): > > [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 > > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > > [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 > > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > > [28] zlibbioc_1.10.0 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > >
ADD REPLYlink written 4.2 years ago by Katja Hebestreit110
remove the trailing whitespace at the end of every line On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah@stanford.edu>wrote: > You can download the file here: > > https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf > > Using file I get the error: > > txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") > Error in .parse_attrCol(attrCol, file, colnames) : > Some attributes do not conform to 'tag value' format > > Thank you so much for helping!! > Katja > > > ----- Original Message ----- > From: "Michael Lawrence" <lawrence.michael@gene.com> > To: "Katja Hebestreit" <katjah@stanford.edu> > Cc: "Michael Lawrence" <lawrence.michael@gene.com>, > bioconductor@r-project.org, "Rsamtools Maintainer" < > maintainer@bioconductor.org> > Sent: Monday, April 14, 2014 7:27:26 AM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > Well, I copied the text and replaced the spaces with tabs as appropriate > and everything seemed to work fine, so you might to attach that fragment of > the file, just to be sure it isn't a formatting issue. > > Does import("file.gtf") work for you? If so, that should be good enough for > your use case. > > Michael > > > On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <katjah@stanford.edu> >wrote: > > > Actually, the error was not reproducible with the lines I attached. But > it > > is reproducible with those lines (four additional lines): > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 - > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3660633 3661579 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 - > > . gene_id "Rp1"; transcript_id "Rp1"; > > chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - 2 > > gene_id "Rp1"; transcript_id "Rp1"; > > > > Let me know if you like to get the entire file. > > > > Thank you!! > > Katja > > > > ----- Original Message ----- > > From: "Michael Lawrence" <lawrence.michael@gene.com> > > To: "Katja Hebestreit" <katjah@stanford.edu> > > Cc: bioconductor@r-project.org, "Rsamtools Maintainer" < > > maintainer@bioconductor.org> > > Sent: Sunday, April 13, 2014 10:02:13 PM > > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > > > On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah@stanford.edu> > >wrote: > > > > > Hello, > > > > > > I get an error when I try to import my gff file: > > > > > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > > > > > Error in .parse_attrCol(attrCol, file, colnames) : > > > Some attributes do not conform to 'tag value' format > > > > > > This is how my file looks like: > > > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 > - > > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - > 2 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - > 1 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - > 0 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > > > I have the feeling that this has something to do with the missing exon > > > rank information in my file. Is that true? Is there a way to import > this > > > file? All I want to do is to determine the gene lengths. > > > > > > > It is most likely as the error says: some of your attributes are > malformed. > > Is that the entire file listed above, or is there more? If you could get > me > > the file somehow I could diagnose the issue. > > > > > > > > > > Could anyone help? That would be awesome! > > > Cheers, > > > Katja > > > > > > > > > sessionInfo() > > > R version 3.1.0 (2014-04-10) > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > > > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > > > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > > > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > > > > > attached base packages: > > > [1] parallel stats graphics grDevices utils datasets methods > > > [8] base > > > > > > other attached packages: > > > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > > > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > > > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > > > > > loaded via a namespace (and not attached): > > > [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 > > > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > > > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > > > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > > > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > > > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > > > [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 > > > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > > > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > > > [28] zlibbioc_1.10.0 > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 4.2 years ago by Vincent J. Carey, Jr.6.2k
Okay, I removed the whitspaces with sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf Here is the file: https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf But still, I get an error: txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", format="gtf") extracting transcript information Estimating transcript ranges. Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : number of items to replace is not a multiple of replacement I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but: mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. When I am using the first 100,000 (or, last 100,000 lines, respectively) it works! head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf --> works! tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf --> works! I have no idea what is going on. Also, the .gtf file is from UCSC. Thanks again for helping with that annoying problem! Katja ----- Original Message ----- From: "Vincent Carey" <stvjc@channing.harvard.ed to:="" "katja="" hebestreit"="" <katjah="" at="" stanford.edu=""> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, "Rsamtools Maintainer" <maintainer at="" bioconductor.org="">, bioconductor at r-project.org Sent: Monday, April 14, 2014 11:45:30 AM Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF remove the trailing whitespace at the end of every line On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote: > You can download the file here: > > https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf > > Using file I get the error: > > txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") > Error in .parse_attrCol(attrCol, file, colnames) : > Some attributes do not conform to 'tag value' format > > Thank you so much for helping!! > Katja > > > ----- Original Message ----- > From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > To: "Katja Hebestreit" <katjah at="" stanford.edu=""> > Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, > bioconductor at r-project.org, "Rsamtools Maintainer" < > maintainer at bioconductor.org> > Sent: Monday, April 14, 2014 7:27:26 AM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > Well, I copied the text and replaced the spaces with tabs as appropriate > and everything seemed to work fine, so you might to attach that fragment of > the file, just to be sure it isn't a formatting issue. > > Does import("file.gtf") work for you? If so, that should be good enough for > your use case. > > Michael > > > On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <katjah at="" stanford.edu=""> >wrote: > > > Actually, the error was not reproducible with the lines I attached. But > it > > is reproducible with those lines (four additional lines): > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 - > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - 2 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - 1 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - 0 > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 - > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat exon 3660633 3661579 0.000000 - . > > gene_id "Xkr4"; transcript_id "Xkr4"; > > chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 - > > . gene_id "Rp1"; transcript_id "Rp1"; > > chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - 2 > > gene_id "Rp1"; transcript_id "Rp1"; > > > > Let me know if you like to get the entire file. > > > > Thank you!! > > Katja > > > > ----- Original Message ----- > > From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > > To: "Katja Hebestreit" <katjah at="" stanford.edu=""> > > Cc: bioconductor at r-project.org, "Rsamtools Maintainer" < > > maintainer at bioconductor.org> > > Sent: Sunday, April 13, 2014 10:02:13 PM > > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > > > On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah at="" stanford.edu=""> > >wrote: > > > > > Hello, > > > > > > I get an error when I try to import my gff file: > > > > > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > > > > > Error in .parse_attrCol(attrCol, file, colnames) : > > > Some attributes do not conform to 'tag value' format > > > > > > This is how my file looks like: > > > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 > - > > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - > 2 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - > 1 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - > 0 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > > > I have the feeling that this has something to do with the missing exon > > > rank information in my file. Is that true? Is there a way to import > this > > > file? All I want to do is to determine the gene lengths. > > > > > > > It is most likely as the error says: some of your attributes are > malformed. > > Is that the entire file listed above, or is there more? If you could get > me > > the file somehow I could diagnose the issue. > > > > > > > > > > Could anyone help? That would be awesome! > > > Cheers, > > > Katja > > > > > > > > > sessionInfo() > > > R version 3.1.0 (2014-04-10) > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > > > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > > > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > > > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > > > > > attached base packages: > > > [1] parallel stats graphics grDevices utils datasets methods > > > [8] base > > > > > > other attached packages: > > > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > > > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > > > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > > > > > loaded via a namespace (and not attached): > > > [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.6.0 > > > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > > > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > > > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > > > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > > > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > > > [19] Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 > > > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > > > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > > > [28] zlibbioc_1.10.0 > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 4.2 years ago by Katja Hebestreit110
I've made rtracklayer 1.25.1 ignore whitespace after the last attribute. After that change, both the truncated and full file parse successfully. Btw, which genes are these? The UCSC genes are already available as a TxDb package. On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah@stanford.edu>wrote: > Okay, I removed the whitspaces with > sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf > > Here is the file: > https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf > > But still, I get an error: > > txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", > format="gtf") > extracting transcript information > Estimating transcript ranges. > Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : > number of items to replace is not a multiple of replacement > > > I tried to reduce the file further, in order to get a smaller example file > and to check which lines cause the problem, but: > > mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. > When I am using the first 100,000 (or, last 100,000 lines, respectively) it > works! > > head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf > --> works! > tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf > --> works! > > I have no idea what is going on. Also, the .gtf file is from UCSC. > > Thanks again for helping with that annoying problem! > Katja > > > ----- Original Message ----- > From: "Vincent Carey" <stvjc@channing.harvard.ed> To: "Katja Hebestreit" <katjah@stanford.edu> > Cc: "Michael Lawrence" <lawrence.michael@gene.com>, "Rsamtools > Maintainer" <maintainer@bioconductor.org>, bioconductor@r-project.org > Sent: Monday, April 14, 2014 11:45:30 AM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > remove the trailing whitespace at the end of every line > > > On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah@stanford.edu> >wrote: > > > You can download the file here: > > > > https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf > > > > Using file I get the error: > > > > txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") > > Error in .parse_attrCol(attrCol, file, colnames) : > > Some attributes do not conform to 'tag value' format > > > > Thank you so much for helping!! > > Katja > > > > > > ----- Original Message ----- > > From: "Michael Lawrence" <lawrence.michael@gene.com> > > To: "Katja Hebestreit" <katjah@stanford.edu> > > Cc: "Michael Lawrence" <lawrence.michael@gene.com>, > > bioconductor@r-project.org, "Rsamtools Maintainer" < > > maintainer@bioconductor.org> > > Sent: Monday, April 14, 2014 7:27:26 AM > > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > > > Well, I copied the text and replaced the spaces with tabs as appropriate > > and everything seemed to work fine, so you might to attach that fragment > of > > the file, just to be sure it isn't a formatting issue. > > > > Does import("file.gtf") work for you? If so, that should be good enough > for > > your use case. > > > > Michael > > > > > > On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <katjah@stanford.edu> > >wrote: > > > > > Actually, the error was not reproducible with the lines I attached. But > > it > > > is reproducible with those lines (four additional lines): > > > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 > - > > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - > 2 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - > 1 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - > 0 > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 > - > > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat exon 3660633 3661579 0.000000 - > . > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 > - > > > . gene_id "Rp1"; transcript_id "Rp1"; > > > chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - > 2 > > > gene_id "Rp1"; transcript_id "Rp1"; > > > > > > Let me know if you like to get the entire file. > > > > > > Thank you!! > > > Katja > > > > > > ----- Original Message ----- > > > From: "Michael Lawrence" <lawrence.michael@gene.com> > > > To: "Katja Hebestreit" <katjah@stanford.edu> > > > Cc: bioconductor@r-project.org, "Rsamtools Maintainer" < > > > maintainer@bioconductor.org> > > > Sent: Sunday, April 13, 2014 10:02:13 PM > > > Subject: Re: [BioC] GenomicFeatures: Problem with > makeTranscriptDbFromGFF > > > > > > On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah@stanford.edu> > > >wrote: > > > > > > > Hello, > > > > > > > > I get an error when I try to import my gff file: > > > > > > > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") > > > > > > > > Error in .parse_attrCol(attrCol, file, colnames) : > > > > Some attributes do not conform to 'tag value' format > > > > > > > > This is how my file looks like: > > > > > > > > chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 > > - > > > > . gene_id "Xkr4"; transcript_id "Xkr4"; > > > > chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - > > 2 > > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > chr1 mm9_refFlat exon 3204563 3207049 0.000000 - > > . > > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - > > 1 > > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > chr1 mm9_refFlat exon 3411783 3411982 0.000000 - > > . > > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - > > 0 > > > > gene_id "Xkr4"; transcript_id "Xkr4"; > > > > > > > > I have the feeling that this has something to do with the missing > exon > > > > rank information in my file. Is that true? Is there a way to import > > this > > > > file? All I want to do is to determine the gene lengths. > > > > > > > > > > It is most likely as the error says: some of your attributes are > > malformed. > > > Is that the entire file listed above, or is there more? If you could > get > > me > > > the file somehow I could diagnose the issue. > > > > > > > > > > > > > > Could anyone help? That would be awesome! > > > > Cheers, > > > > Katja > > > > > > > > > > > > sessionInfo() > > > > R version 3.1.0 (2014-04-10) > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > locale: > > > > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > > > > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > > > > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > > > > [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C > > > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > > > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > > > > > > > attached base packages: > > > > [1] parallel stats graphics grDevices utils datasets > methods > > > > [8] base > > > > > > > > other attached packages: > > > > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 > > > > [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 > > > > [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 > > > > > > > > loaded via a namespace (and not attached): > > > > [1] BatchJobs_1.2 BBmisc_1.5 > BiocParallel_0.6.0 > > > > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > > > > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > > > > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > > > > [13] foreach_1.4.2 GenomicAlignments_1.0.0 iterators_1.0.7 > > > > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > > > > [19] Rsamtools_1.16.0 RSQLite_0.11.4 > rtracklayer_1.24.0 > > > > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > > > > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > > > > [28] zlibbioc_1.10.0 > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor@r-project.org > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]
ADD REPLYlink written 4.2 years ago by Michael Lawrence10.0k
0
gravatar for Martin Morgan
4.2 years ago by
Martin Morgan ♦♦ 21k
United States
Martin Morgan ♦♦ 21k wrote:
On 04/15/2014 10:53 AM, Maintainer wrote: > Dear Bioconductor maintainers, > > I still have a problem using makeTranscriptDbFromGFF: > > txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", > format="gtf") > extracting transcript information > Estimating transcript ranges. > Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : > number of items to replace is not a multiple of replacement > If I options(error=recover) then > xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf", format="gtf") extracting transcript information Estimating transcript ranges. Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : number of items to replace is not a multiple of replacement length Enter a frame number, or 0 to exit 1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format = "gtf") 2: .prepareGTFTables(gff, exonRankAttributeName) 3: .prepareGTFtranscripts(data) 4: .deduceTranscriptsFromGTF(transcripts) Selection: 4 Called from: .deduceTranscriptsFromGTF(transcripts) Browse[1]> then look around a little, e.g., Browse[1]> .deduceTranscriptsFromGTF ## ... I spot the offending line for (i in seq_len(length(trns))) { res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]]) Browse[1]> subs[[i]] seqnames start end strand type gene_id transcript_id exon_rank 47369 chr2 36245431 36245515 + exon Mir684-1 Mir684-1 NA 147043 chr5 23763346 23763430 + exon Mir684-1 Mir684-1 NA Browse[1]> .deduceTranscriptRangeData(subs[[i]]) [1] "Mir684-1" "chr2" "chr5" "+" "23763346" "36245515" "Mir684-1" I guess the problem is that the exons are coming from two separate chromosomes, and .deduceTranscriptRangeData does not handle that case successfully. Hopefully the code will be made more robust / efficient, but in the mean time you could edit your file to remove the troublesome Mir684-1. Martin > I uploaded the file here: > https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf > > I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but: > mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. > When I am using the first 100,000 (or, last 100,000 lines, respectively) it works! > > head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf > --> works! > tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf > --> works! > > I have no idea what is going on. Also, the .gtf file is from UCSC. > > Do you have any idea what the problem is? > Thank you so much! > Katja > > > ----- Forwarded Message ----- > From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > To: "Katja Hebestreit" <katjah at="" stanford.edu=""> > Cc: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > Sent: Monday, April 14, 2014 7:37:23 PM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I didn't write > that one, so the maintainers of GenomicFeatures will need to help now. > > > On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote: > >> Great! Thank you so much, Micheal! >> >> Unfortunately, I still have problems with this file: >> >> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >> format="gtf") >> extracting transcript information >> Estimating transcript ranges. >> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >> number of items to replace is not a multiple of replacement >> >> To determine the gene lengths I wanted to use the exact .gtf file we used >> for counting the reads per gene some time ago. >> >> >> ----- Original Message ----- >> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >> Cc: "Vincent Carey" <stvjc at="" channing.harvard.edu="">, "Michael Lawrence" < >> lawrence.michael at gene.com>, "Rsamtools Maintainer" < >> maintainer at bioconductor.org>, bioconductor at r-project.org >> Sent: Monday, April 14, 2014 2:55:06 PM >> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >> >> I've made rtracklayer 1.25.1 ignore whitespace after the last attribute. >> After that change, both the truncated and full file parse successfully. >> Btw, which genes are these? The UCSC genes are already available as a TxDb >> package. >> >> >> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at="" stanford.edu="">>> wrote: >> >>> Okay, I removed the whitspaces with >>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf >>> >>> Here is the file: >>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf >>> >>> But still, I get an error: >>> >>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >>> format="gtf") >>> extracting transcript information >>> Estimating transcript ranges. >>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >>> number of items to replace is not a multiple of replacement >>> >>> >>> I tried to reduce the file further, in order to get a smaller example >> file >>> and to check which lines cause the problem, but: >>> >>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. >>> When I am using the first 100,000 (or, last 100,000 lines, respectively) >> it >>> works! >>> >>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >>> --> works! >>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >>> --> works! >>> >>> I have no idea what is going on. Also, the .gtf file is from UCSC. >>> >>> Thanks again for helping with that annoying problem! >>> Katja >>> >>> >>> ----- Original Message ----- >>> From: "Vincent Carey" <stvjc at="" channing.harvard.ed="">>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, "Rsamtools >>> Maintainer" <maintainer at="" bioconductor.org="">, bioconductor at r-project.org >>> Sent: Monday, April 14, 2014 11:45:30 AM >>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >>> >>> remove the trailing whitespace at the end of every line >>> >>> >>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at="" stanford.edu="">>>> wrote: >>> >>>> You can download the file here: >>>> >>>> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf >>>> >>>> Using file I get the error: >>>> >>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") >>>> Error in .parse_attrCol(attrCol, file, colnames) : >>>> Some attributes do not conform to 'tag value' format >>>> >>>> Thank you so much for helping!! >>>> Katja >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, >>>> bioconductor at r-project.org, "Rsamtools Maintainer" < >>>> maintainer at bioconductor.org> >>>> Sent: Monday, April 14, 2014 7:27:26 AM >>>> Subject: Re: [BioC] GenomicFeatures: Problem with >> makeTranscriptDbFromGFF >>>> >>>> Well, I copied the text and replaced the spaces with tabs as >> appropriate >>>> and everything seemed to work fine, so you might to attach that >> fragment >>> of >>>> the file, just to be sure it isn't a formatting issue. >>>> >>>> Does import("file.gtf") work for you? If so, that should be good enough >>> for >>>> your use case. >>>> >>>> Michael >>>> >>>> >>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit < >> katjah at stanford.edu >>>>> wrote: >>>> >>>>> Actually, the error was not reproducible with the lines I attached. >> But >>>> it >>>>> is reproducible with those lines (four additional lines): >>>>> >>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 >>> - >>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - >>> 2 >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 - >>> . >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - >>> 1 >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 - >>> . >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - >>> 0 >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 >>> - >>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat exon 3660633 3661579 0.000000 - >>> . >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 >>> - >>>>> . gene_id "Rp1"; transcript_id "Rp1"; >>>>> chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - >>> 2 >>>>> gene_id "Rp1"; transcript_id "Rp1"; >>>>> >>>>> Let me know if you like to get the entire file. >>>>> >>>>> Thank you!! >>>>> Katja >>>>> >>>>> ----- Original Message ----- >>>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" < >>>>> maintainer at bioconductor.org> >>>>> Sent: Sunday, April 13, 2014 10:02:13 PM >>>>> Subject: Re: [BioC] GenomicFeatures: Problem with >>> makeTranscriptDbFromGFF >>>>> >>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit < >> katjah at stanford.edu >>>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I get an error when I try to import my gff file: >>>>>> >>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") >>>>>> >>>>>> Error in .parse_attrCol(attrCol, file, colnames) : >>>>>> Some attributes do not conform to 'tag value' format >>>>>> >>>>>> This is how my file looks like: >>>>>> >>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 >>>> - >>>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - >>>> 2 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - >>>> 1 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - >>>> 0 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> >>>>>> I have the feeling that this has something to do with the missing >>> exon >>>>>> rank information in my file. Is that true? Is there a way to import >>>> this >>>>>> file? All I want to do is to determine the gene lengths. >>>>>> >>>>> >>>>> It is most likely as the error says: some of your attributes are >>>> malformed. >>>>> Is that the entire file listed above, or is there more? If you could >>> get >>>> me >>>>> the file somehow I could diagnose the issue. >>>>> >>>>> >>>>>> >>>>>> Could anyone help? That would be awesome! >>>>>> Cheers, >>>>>> Katja >>>>>> >>>>>> >>>>>> sessionInfo() >>>>>> R version 3.1.0 (2014-04-10) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 >>>>>> [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 >>>>>> [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] BatchJobs_1.2 BBmisc_1.5 >>> BiocParallel_0.6.0 >>>>>> [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 >>>>>> [7] brew_1.0-6 BSgenome_1.32.0 >> codetools_0.2-8 >>>>>> [10] DBI_0.2-7 digest_0.6.4 fail_1.2 >>>>>> [13] foreach_1.4.2 GenomicAlignments_1.0.0 >> iterators_1.0.7 >>>>>> [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 >>>>>> [19] Rsamtools_1.16.0 RSQLite_0.11.4 >>> rtracklayer_1.24.0 >>>>>> [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >>>>>> [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 >>>>>> [28] zlibbioc_1.10.0 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENTlink written 4.2 years ago by Martin Morgan ♦♦ 21k
Thank you so much for pointing that out!! Unfortunately, there are a bunch of genes with exomes on different chromosomes. And this is also the case for newer versions of that .gtf file (refFlat from UCSC). Do you have an idea when you will fix that? Cheers, Katja ----- Original Message ----- From: "Martin Morgan" <mtmorgan@fhcrc.org> To: katjah at stanford.edu, bioconductor at r-project.org Sent: Tuesday, April 15, 2014 1:27:30 PM Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF On 04/15/2014 10:53 AM, Maintainer wrote: > Dear Bioconductor maintainers, > > I still have a problem using makeTranscriptDbFromGFF: > > txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", > format="gtf") > extracting transcript information > Estimating transcript ranges. > Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : > number of items to replace is not a multiple of replacement > If I options(error=recover) then > xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf", format="gtf") extracting transcript information Estimating transcript ranges. Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : number of items to replace is not a multiple of replacement length Enter a frame number, or 0 to exit 1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format = "gtf") 2: .prepareGTFTables(gff, exonRankAttributeName) 3: .prepareGTFtranscripts(data) 4: .deduceTranscriptsFromGTF(transcripts) Selection: 4 Called from: .deduceTranscriptsFromGTF(transcripts) Browse[1]> then look around a little, e.g., Browse[1]> .deduceTranscriptsFromGTF ## ... I spot the offending line for (i in seq_len(length(trns))) { res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]]) Browse[1]> subs[[i]] seqnames start end strand type gene_id transcript_id exon_rank 47369 chr2 36245431 36245515 + exon Mir684-1 Mir684-1 NA 147043 chr5 23763346 23763430 + exon Mir684-1 Mir684-1 NA Browse[1]> .deduceTranscriptRangeData(subs[[i]]) [1] "Mir684-1" "chr2" "chr5" "+" "23763346" "36245515" "Mir684-1" I guess the problem is that the exons are coming from two separate chromosomes, and .deduceTranscriptRangeData does not handle that case successfully. Hopefully the code will be made more robust / efficient, but in the mean time you could edit your file to remove the troublesome Mir684-1. Martin > I uploaded the file here: > https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf > > I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but: > mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. > When I am using the first 100,000 (or, last 100,000 lines, respectively) it works! > > head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf > --> works! > tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf > --> works! > > I have no idea what is going on. Also, the .gtf file is from UCSC. > > Do you have any idea what the problem is? > Thank you so much! > Katja > > > ----- Forwarded Message ----- > From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > To: "Katja Hebestreit" <katjah at="" stanford.edu=""> > Cc: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > Sent: Monday, April 14, 2014 7:37:23 PM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I didn't write > that one, so the maintainers of GenomicFeatures will need to help now. > > > On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote: > >> Great! Thank you so much, Micheal! >> >> Unfortunately, I still have problems with this file: >> >> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >> format="gtf") >> extracting transcript information >> Estimating transcript ranges. >> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >> number of items to replace is not a multiple of replacement >> >> To determine the gene lengths I wanted to use the exact .gtf file we used >> for counting the reads per gene some time ago. >> >> >> ----- Original Message ----- >> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >> Cc: "Vincent Carey" <stvjc at="" channing.harvard.edu="">, "Michael Lawrence" < >> lawrence.michael at gene.com>, "Rsamtools Maintainer" < >> maintainer at bioconductor.org>, bioconductor at r-project.org >> Sent: Monday, April 14, 2014 2:55:06 PM >> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >> >> I've made rtracklayer 1.25.1 ignore whitespace after the last attribute. >> After that change, both the truncated and full file parse successfully. >> Btw, which genes are these? The UCSC genes are already available as a TxDb >> package. >> >> >> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at="" stanford.edu="">>> wrote: >> >>> Okay, I removed the whitspaces with >>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf >>> >>> Here is the file: >>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf >>> >>> But still, I get an error: >>> >>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >>> format="gtf") >>> extracting transcript information >>> Estimating transcript ranges. >>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >>> number of items to replace is not a multiple of replacement >>> >>> >>> I tried to reduce the file further, in order to get a smaller example >> file >>> and to check which lines cause the problem, but: >>> >>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. >>> When I am using the first 100,000 (or, last 100,000 lines, respectively) >> it >>> works! >>> >>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >>> --> works! >>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >>> --> works! >>> >>> I have no idea what is going on. Also, the .gtf file is from UCSC. >>> >>> Thanks again for helping with that annoying problem! >>> Katja >>> >>> >>> ----- Original Message ----- >>> From: "Vincent Carey" <stvjc at="" channing.harvard.ed="">>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, "Rsamtools >>> Maintainer" <maintainer at="" bioconductor.org="">, bioconductor at r-project.org >>> Sent: Monday, April 14, 2014 11:45:30 AM >>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >>> >>> remove the trailing whitespace at the end of every line >>> >>> >>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at="" stanford.edu="">>>> wrote: >>> >>>> You can download the file here: >>>> >>>> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf >>>> >>>> Using file I get the error: >>>> >>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") >>>> Error in .parse_attrCol(attrCol, file, colnames) : >>>> Some attributes do not conform to 'tag value' format >>>> >>>> Thank you so much for helping!! >>>> Katja >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, >>>> bioconductor at r-project.org, "Rsamtools Maintainer" < >>>> maintainer at bioconductor.org> >>>> Sent: Monday, April 14, 2014 7:27:26 AM >>>> Subject: Re: [BioC] GenomicFeatures: Problem with >> makeTranscriptDbFromGFF >>>> >>>> Well, I copied the text and replaced the spaces with tabs as >> appropriate >>>> and everything seemed to work fine, so you might to attach that >> fragment >>> of >>>> the file, just to be sure it isn't a formatting issue. >>>> >>>> Does import("file.gtf") work for you? If so, that should be good enough >>> for >>>> your use case. >>>> >>>> Michael >>>> >>>> >>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit < >> katjah at stanford.edu >>>>> wrote: >>>> >>>>> Actually, the error was not reproducible with the lines I attached. >> But >>>> it >>>>> is reproducible with those lines (four additional lines): >>>>> >>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 >>> - >>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - >>> 2 >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 - >>> . >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - >>> 1 >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 - >>> . >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - >>> 0 >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 >>> - >>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat exon 3660633 3661579 0.000000 - >>> . >>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>> chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 >>> - >>>>> . gene_id "Rp1"; transcript_id "Rp1"; >>>>> chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - >>> 2 >>>>> gene_id "Rp1"; transcript_id "Rp1"; >>>>> >>>>> Let me know if you like to get the entire file. >>>>> >>>>> Thank you!! >>>>> Katja >>>>> >>>>> ----- Original Message ----- >>>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" < >>>>> maintainer at bioconductor.org> >>>>> Sent: Sunday, April 13, 2014 10:02:13 PM >>>>> Subject: Re: [BioC] GenomicFeatures: Problem with >>> makeTranscriptDbFromGFF >>>>> >>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit < >> katjah at stanford.edu >>>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I get an error when I try to import my gff file: >>>>>> >>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") >>>>>> >>>>>> Error in .parse_attrCol(attrCol, file, colnames) : >>>>>> Some attributes do not conform to 'tag value' format >>>>>> >>>>>> This is how my file looks like: >>>>>> >>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 >>>> - >>>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - >>>> 2 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - >>>> 1 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - >>>> 0 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> >>>>>> I have the feeling that this has something to do with the missing >>> exon >>>>>> rank information in my file. Is that true? Is there a way to import >>>> this >>>>>> file? All I want to do is to determine the gene lengths. >>>>>> >>>>> >>>>> It is most likely as the error says: some of your attributes are >>>> malformed. >>>>> Is that the entire file listed above, or is there more? If you could >>> get >>>> me >>>>> the file somehow I could diagnose the issue. >>>>> >>>>> >>>>>> >>>>>> Could anyone help? That would be awesome! >>>>>> Cheers, >>>>>> Katja >>>>>> >>>>>> >>>>>> sessionInfo() >>>>>> R version 3.1.0 (2014-04-10) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 >>>>>> [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 >>>>>> [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] BatchJobs_1.2 BBmisc_1.5 >>> BiocParallel_0.6.0 >>>>>> [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 >>>>>> [7] brew_1.0-6 BSgenome_1.32.0 >> codetools_0.2-8 >>>>>> [10] DBI_0.2-7 digest_0.6.4 fail_1.2 >>>>>> [13] foreach_1.4.2 GenomicAlignments_1.0.0 >> iterators_1.0.7 >>>>>> [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 >>>>>> [19] Rsamtools_1.16.0 RSQLite_0.11.4 >>> rtracklayer_1.24.0 >>>>>> [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >>>>>> [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 >>>>>> [28] zlibbioc_1.10.0 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLYlink written 4.2 years ago by Katja Hebestreit110
Hi Katja, Sorry about the earlier confusion about where you posted your question. Because you posted the same question to two different mailing lists (double post), my email client 'lost' the thread that went to our regular mailing list. Also thank you for mentioning this problem with deducing the transcripts. However my solution for this problem will not be very ideal since it is fundamentally impossible to guess the transcripts exon ordering if the exons are not even on the same chromosomes. So if your GFF file does not contain exon ordering for the transcripts then I fear that you are out of luck (as the information is clearly necessary and also not available). To make the code more robust I will look into catching this situation when it comes up and throwing out transcripts that are like that (with a warning). But this begs the question: do you really need a full TranscriptDb object? What were you planning to do with this? Marc On 04/15/2014 02:02 PM, Katja Hebestreit wrote: > Thank you so much for pointing that out!! > > Unfortunately, there are a bunch of genes with exomes on different chromosomes. And this is also the case for newer versions of that .gtf file (refFlat from UCSC). Do you have an idea when you will fix that? > > Cheers, > Katja > > ----- Original Message ----- > From: "Martin Morgan" <mtmorgan at="" fhcrc.org=""> > To: katjah at stanford.edu, bioconductor at r-project.org > Sent: Tuesday, April 15, 2014 1:27:30 PM > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF > > On 04/15/2014 10:53 AM, Maintainer wrote: >> Dear Bioconductor maintainers, >> >> I still have a problem using makeTranscriptDbFromGFF: >> >> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >> format="gtf") >> extracting transcript information >> Estimating transcript ranges. >> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >> number of items to replace is not a multiple of replacement >> > If I > > options(error=recover) > > then > > > xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf", format="gtf") > extracting transcript information > Estimating transcript ranges. > Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : > number of items to replace is not a multiple of replacement length > > Enter a frame number, or 0 to exit > > 1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format = "gtf") > 2: .prepareGTFTables(gff, exonRankAttributeName) > 3: .prepareGTFtranscripts(data) > 4: .deduceTranscriptsFromGTF(transcripts) > > Selection: 4 > Called from: .deduceTranscriptsFromGTF(transcripts) > Browse[1]> > > then look around a little, e.g., > > Browse[1]> .deduceTranscriptsFromGTF > ## ... > > I spot the offending line > > for (i in seq_len(length(trns))) { > res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) > > and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]]) > > Browse[1]> subs[[i]] > seqnames start end strand type gene_id transcript_id exon_rank > 47369 chr2 36245431 36245515 + exon Mir684-1 Mir684-1 NA > 147043 chr5 23763346 23763430 + exon Mir684-1 Mir684-1 NA > Browse[1]> .deduceTranscriptRangeData(subs[[i]]) > [1] "Mir684-1" "chr2" "chr5" "+" "23763346" "36245515" "Mir684-1" > > I guess the problem is that the exons are coming from two separate chromosomes, > and .deduceTranscriptRangeData does not handle that case successfully. > > Hopefully the code will be made more robust / efficient, but in the mean time > you could edit your file to remove the troublesome Mir684-1. > > Martin > > >> I uploaded the file here: >> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf >> >> I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but: >> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. >> When I am using the first 100,000 (or, last 100,000 lines, respectively) it works! >> >> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >> --> works! >> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >> --> works! >> >> I have no idea what is going on. Also, the .gtf file is from UCSC. >> >> Do you have any idea what the problem is? >> Thank you so much! >> Katja >> >> >> ----- Forwarded Message ----- >> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >> Sent: Monday, April 14, 2014 7:37:23 PM >> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >> >> Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I didn't write >> that one, so the maintainers of GenomicFeatures will need to help now. >> >> >> On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at="" stanford.edu="">wrote: >> >>> Great! Thank you so much, Micheal! >>> >>> Unfortunately, I still have problems with this file: >>> >>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >>> format="gtf") >>> extracting transcript information >>> Estimating transcript ranges. >>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >>> number of items to replace is not a multiple of replacement >>> >>> To determine the gene lengths I wanted to use the exact .gtf file we used >>> for counting the reads per gene some time ago. >>> >>> >>> ----- Original Message ----- >>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>> Cc: "Vincent Carey" <stvjc at="" channing.harvard.edu="">, "Michael Lawrence" < >>> lawrence.michael at gene.com>, "Rsamtools Maintainer" < >>> maintainer at bioconductor.org>, bioconductor at r-project.org >>> Sent: Monday, April 14, 2014 2:55:06 PM >>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >>> >>> I've made rtracklayer 1.25.1 ignore whitespace after the last attribute. >>> After that change, both the truncated and full file parse successfully. >>> Btw, which genes are these? The UCSC genes are already available as a TxDb >>> package. >>> >>> >>> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at="" stanford.edu="">>>> wrote: >>>> Okay, I removed the whitspaces with >>>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf >>>> >>>> Here is the file: >>>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf >>>> >>>> But still, I get an error: >>>> >>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", >>>> format="gtf") >>>> extracting transcript information >>>> Estimating transcript ranges. >>>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : >>>> number of items to replace is not a multiple of replacement >>>> >>>> >>>> I tried to reduce the file further, in order to get a smaller example >>> file >>>> and to check which lines cause the problem, but: >>>> >>>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. >>>> When I am using the first 100,000 (or, last 100,000 lines, respectively) >>> it >>>> works! >>>> >>>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >>>> --> works! >>>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf >>>> --> works! >>>> >>>> I have no idea what is going on. Also, the .gtf file is from UCSC. >>>> >>>> Thanks again for helping with that annoying problem! >>>> Katja >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Vincent Carey" <stvjc at="" channing.harvard.ed="">>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, "Rsamtools >>>> Maintainer" <maintainer at="" bioconductor.org="">, bioconductor at r-project.org >>>> Sent: Monday, April 14, 2014 11:45:30 AM >>>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF >>>> >>>> remove the trailing whitespace at the end of every line >>>> >>>> >>>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at="" stanford.edu="">>>>> wrote: >>>>> You can download the file here: >>>>> >>>>> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf >>>>> >>>>> Using file I get the error: >>>>> >>>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf") >>>>> Error in .parse_attrCol(attrCol, file, colnames) : >>>>> Some attributes do not conform to 'tag value' format >>>>> >>>>> Thank you so much for helping!! >>>>> Katja >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>>> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, >>>>> bioconductor at r-project.org, "Rsamtools Maintainer" < >>>>> maintainer at bioconductor.org> >>>>> Sent: Monday, April 14, 2014 7:27:26 AM >>>>> Subject: Re: [BioC] GenomicFeatures: Problem with >>> makeTranscriptDbFromGFF >>>>> Well, I copied the text and replaced the spaces with tabs as >>> appropriate >>>>> and everything seemed to work fine, so you might to attach that >>> fragment >>>> of >>>>> the file, just to be sure it isn't a formatting issue. >>>>> >>>>> Does import("file.gtf") work for you? If so, that should be good enough >>>> for >>>>> your use case. >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit < >>> katjah at stanford.edu >>>>>> wrote: >>>>>> Actually, the error was not reproducible with the lines I attached. >>> But >>>>> it >>>>>> is reproducible with those lines (four additional lines): >>>>>> >>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 >>>> - >>>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - >>>> 2 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - >>>> 1 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - >>>> 0 >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat start_codon 3661427 3661429 0.000000 >>>> - >>>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat exon 3660633 3661579 0.000000 - >>>> . >>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>> chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000 >>>> - >>>>>> . gene_id "Rp1"; transcript_id "Rp1"; >>>>>> chr1 mm9_refFlat CDS 4283065 4283093 0.000000 - >>>> 2 >>>>>> gene_id "Rp1"; transcript_id "Rp1"; >>>>>> >>>>>> Let me know if you like to get the entire file. >>>>>> >>>>>> Thank you!! >>>>>> Katja >>>>>> >>>>>> ----- Original Message ----- >>>>>> From: "Michael Lawrence" <lawrence.michael at="" gene.com=""> >>>>>> To: "Katja Hebestreit" <katjah at="" stanford.edu=""> >>>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" < >>>>>> maintainer at bioconductor.org> >>>>>> Sent: Sunday, April 13, 2014 10:02:13 PM >>>>>> Subject: Re: [BioC] GenomicFeatures: Problem with >>>> makeTranscriptDbFromGFF >>>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit < >>> katjah at stanford.edu >>>>>>> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> I get an error when I try to import my gff file: >>>>>>> >>>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf") >>>>>>> >>>>>>> Error in .parse_attrCol(attrCol, file, colnames) : >>>>>>> Some attributes do not conform to 'tag value' format >>>>>>> >>>>>>> This is how my file looks like: >>>>>>> >>>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000 >>>>> - >>>>>>> . gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 - >>>>> 2 >>>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 - >>>>> . >>>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 - >>>>> 1 >>>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 - >>>>> . >>>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 - >>>>> 0 >>>>>>> gene_id "Xkr4"; transcript_id "Xkr4"; >>>>>>> >>>>>>> I have the feeling that this has something to do with the missing >>>> exon >>>>>>> rank information in my file. Is that true? Is there a way to import >>>>> this >>>>>>> file? All I want to do is to determine the gene lengths. >>>>>>> >>>>>> It is most likely as the error says: some of your attributes are >>>>> malformed. >>>>>> Is that the entire file listed above, or is there more? If you could >>>> get >>>>> me >>>>>> the file somehow I could diagnose the issue. >>>>>> >>>>>> >>>>>>> Could anyone help? That would be awesome! >>>>>>> Cheers, >>>>>>> Katja >>>>>>> >>>>>>> >>>>>>> sessionInfo() >>>>>>> R version 3.1.0 (2014-04-10) >>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>>>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] parallel stats graphics grDevices utils datasets >>>> methods >>>>>>> [8] base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6 >>>>>>> [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45 >>>>>>> [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] BatchJobs_1.2 BBmisc_1.5 >>>> BiocParallel_0.6.0 >>>>>>> [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 >>>>>>> [7] brew_1.0-6 BSgenome_1.32.0 >>> codetools_0.2-8 >>>>>>> [10] DBI_0.2-7 digest_0.6.4 fail_1.2 >>>>>>> [13] foreach_1.4.2 GenomicAlignments_1.0.0 >>> iterators_1.0.7 >>>>>>> [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 >>>>>>> [19] Rsamtools_1.16.0 RSQLite_0.11.4 >>>> rtracklayer_1.24.0 >>>>>>> [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >>>>>>> [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 >>>>>>> [28] zlibbioc_1.10.0 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> ___________________________________________________________________ _____ >> devteam-bioc mailing list >> To unsubscribe from this mailing list send a blank email to >> devteam-bioc-leave at lists.fhcrc.org >> You can also unsubscribe or change your personal options at >> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc >> >
ADD REPLYlink written 4.2 years ago by Marc Carlson7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 119 users visited in the last hour