GTF file error when using easyRNAseq�

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

Dear all, I would like to make a count table to use it in DESeq. I??ve tried to use easyRNAseq but I have a problem with the annotation file. I???ve downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot is empty, I modified the file and added chr before the chromosome number. The next problem was this: Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. To solve this problem: - I deleted all the entries without gene_name (first example): gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000301221"; gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; - I checked the chromosome numbers and deleted the entries that didn???t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can???t find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf file, I don???t know if it is a problem): - I searched for semicolons and single quotes ??? in the gene names, but I didn???t find any on the final file. - I deleted all the columns after gene_name. So finally the annotation file entries look like this: chr1 protein_coding exon 19962541 19963992 . + . gene_id "ENSGALG00000000003"; transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; Nothing works; the error message is always the same. So, I don???t know what else I can do. Could you please help me? Thank you in advance! Cheers Natalia here is my code: > count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", chrSizes="chrSizes", annotationMethod="gtf", annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", summarization="geneModels", format="bam", gapped=TRUE, filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", conditions=conditions) Checking arguments... Fetching annotations... Read 334620 records Error en .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, : Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. Adem??s: Mensajes de aviso perdidos 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", : Your organism has no mapping defined to perform the validity check for the UCSC compliance of the chromosome name. Defined organism's mapping can be listed using the 'knownOrganisms' function. To benefit from the validity check, you can provide a 'chr.map' to your 'easyRNASeq' function call. As you did not do so, 'validity.check' is turned off 2: In .Method(..., deparse.level = deparse.level) : number of columns of result is not a multiple of vector length (arg 1) > traceback() 6: stop(paste("Your gtf file: ", filename, " does not contain all the required fields: ", paste(fields, collapse = ", "), ".", sep = "")) 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, ...) 4: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, ignoreWarnings = ignoreWarnings, ...) 3: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, ignoreWarnings = ignoreWarnings, ...) 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", count = "genes", summarization = "geneModels", format = "bam", gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), outputFormat = "DESeq", conditions = conditions) 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", count = "genes", summarization = "geneModels", format = "bam", gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), outputFormat = "DESeq", conditions = conditions) -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 easyRNASeq_1.8.1 ShortRead_1.20.0 [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 DESeq_1.14.0 lattice_0.20-23 [9] locfit_1.5-9.1 Biostrings_2.30.0 XVector_0.2.0 IRanges_1.20.4 [13] edgeR_3.4.0 limma_3.18.2 biomaRt_2.18.0 Biobase_2.22.0 [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 intervals_0.14.0 BiocInstaller_1.12.0 loaded via a namespace (and not attached): [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 splines_3.0.2 [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.

Annotation Organism BSgenome BSgenome DESeq easyRNASeq Annotation Organism BSgenome DESeq • 3.6k views

ADD COMMENT • link updated 12.2 years ago by delhomme@embl.de ★ 1.2k • written 12.2 years ago by Guest User ★ 13k

0

Entering edit mode

delhomme@embl.de ★ 1.2k

@delhommeemblde-3232

Last seen 11.4 years ago

Hej Natalia! This is not the first time that I?ve seen this error on the list, but I?ve not been able to reproduce it so far with my own data. Would you mind sharing some data offline, just an excerpt of your files would do. If that?s OK, I can create and give access to a folder on my box account. I had already relaxed the constraint on parsing a gtf file in a previous update but forgot to reflect the changes in the error message. Only the gene_id and transcript_id are actually mandatory. I would not expect any issue with the EnsEMBL gtf file, but I?ll have a look at why it fails for Gallus galls one and let you know asap. This as nothing to do with this error, but by looking at your command line, I saw that you provide a character vector to the chrSize argument. This is not necessary as this information is extracted from the bam file in your case, so you can just drop the chrSizes = ?chrSizes? from your command line. I?ve added some extra check in the method to detect this now. Thanks :-) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 11 Nov 2013, at 16:31, Natalia [guest] <guest at="" bioconductor.org=""> wrote: > > Dear all, > I would like to make a count table to use it in DESeq. I?ve tried to use easyRNAseq but I have a problem with the annotation file. I?ve downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot is empty, I modified the file and added chr before the chromosome number. The next problem was this: > > Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. > > To solve this problem: > - I deleted all the entries without gene_name (first example): > > gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000301221"; > > gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; > > - I checked the chromosome numbers and deleted the entries that didn?t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can?t find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf file, I don?t know if it is a problem): > > - I searched for semicolons and single quotes ? in the gene names, but I didn?t find any on the final file. > > - I deleted all the columns after gene_name. > > So finally the annotation file entries look like this: > chr1 protein_coding exon 19962541 19963992 > . + . gene_id "ENSGALG00000000003"; transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; > > Nothing works; the error message is always the same. So, I don?t know what else I can do. Could you please help me? > Thank you in advance! > Cheers > > Natalia > > > here is my code: >> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", chrSizes="chrSizes", annotationMethod="gtf", annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", summarization="geneModels", format="bam", gapped=TRUE, filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", conditions=conditions) > Checking arguments... > Fetching annotations... > Read 334620 records > Error en .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, : > Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. > Adem?s: Mensajes de aviso perdidos > 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", : > Your organism has no mapping defined to perform the validity check for the UCSC compliance of the chromosome name. > Defined organism's mapping can be listed using the 'knownOrganisms' function. > To benefit from the validity check, you can provide a 'chr.map' to your 'easyRNASeq' function call. > As you did not do so, 'validity.check' is turned off > 2: In .Method(..., deparse.level = deparse.level) : > number of columns of result is not a multiple of vector length (arg 1) > >> traceback() > 6: stop(paste("Your gtf file: ", filename, " does not contain all the required fields: ", > paste(fields, collapse = ", "), ".", sep = "")) > 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, > ...) > 4: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, > ignoreWarnings = ignoreWarnings, ...) > 3: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, > ignoreWarnings = ignoreWarnings, ...) > 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", > annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", > count = "genes", summarization = "geneModels", format = "bam", > gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > outputFormat = "DESeq", conditions = conditions) > 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", > annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", > count = "genes", summarization = "geneModels", format = "bam", > gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > outputFormat = "DESeq", conditions = conditions) > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 easyRNASeq_1.8.1 ShortRead_1.20.0 > [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 DESeq_1.14.0 lattice_0.20-23 > [9] locfit_1.5-9.1 Biostrings_2.30.0 XVector_0.2.0 IRanges_1.20.4 > [13] edgeR_3.4.0 limma_3.18.2 biomaRt_2.18.0 Biobase_2.22.0 > [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 intervals_0.14.0 BiocInstaller_1.12.0 > > loaded via a namespace (and not attached): > [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 splines_3.0.2 > [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 > > > -- > Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 12.2 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Hej Natalia! There were a number of lines in that particular gtf that violated the assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' column were always set and one of the gene name had a space character in it. I?ve made the parsing of gtf file annotation more flexible/lenient and that should resolve that particular issue you had. The changes should propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. Rather than using the geneModel, which implementation is old and has gotten slow because of changes in the underlying architecture, I prefer an approach where I 1) filter the gtf / gff annotation file for only those lines I?m interested in (e.g. of type exon, mRNA and gene for a gff file) 2) collapse every exon of a gene into what I call now a ?synthetic transcript?. The reason for changing the naming from geneModel to synthetic transcript is that ?gene model? has different meaning depending on the field of application. 3) use easyRNASeq with the modified annotation and the ?transcript? count method. I?ve detailed that procedure in the vignette section 7.1 . Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > Hej Natalia! > > This is not the first time that I?ve seen this error on the list, but I?ve not been able to reproduce it so far with my own data. Would you mind sharing some data offline, just an excerpt of your files would do. If that?s OK, I can create and give access to a folder on my box account. > > I had already relaxed the constraint on parsing a gtf file in a previous update but forgot to reflect the changes in the error message. Only the gene_id and transcript_id are actually mandatory. I would not expect any issue with the EnsEMBL gtf file, but I?ll have a look at why it fails for Gallus galls one and let you know asap. > > This as nothing to do with this error, but by looking at your command line, I saw that you provide a character vector to the chrSize argument. This is not necessary as this information is extracted from the bam file in your case, so you can just drop the chrSizes = ?chrSizes? from your command line. I?ve added some extra check in the method to detect this now. Thanks :-) > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 11 Nov 2013, at 16:31, Natalia [guest] <guest at="" bioconductor.org=""> wrote: > >> >> Dear all, >> I would like to make a count table to use it in DESeq. I?ve tried to use easyRNAseq but I have a problem with the annotation file. I?ve downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot is empty, I modified the file and added chr before the chromosome number. The next problem was this: >> >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. >> >> To solve this problem: >> - I deleted all the entries without gene_name (first example): >> >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000301221"; >> >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; >> >> - I checked the chromosome numbers and deleted the entries that didn?t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can?t find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf file, I don?t know if it is a problem): >> >> - I searched for semicolons and single quotes ? in the gene names, but I didn?t find any on the final file. >> >> - I deleted all the columns after gene_name. >> >> So finally the annotation file entries look like this: >> chr1 protein_coding exon 19962541 19963992 >> . + . gene_id "ENSGALG00000000003"; transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; >> >> Nothing works; the error message is always the same. So, I don?t know what else I can do. Could you please help me? >> Thank you in advance! >> Cheers >> >> Natalia >> >> >> here is my code: >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", chrSizes="chrSizes", annotationMethod="gtf", annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", summarization="geneModels", format="bam", gapped=TRUE, filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", conditions=conditions) >> Checking arguments... >> Fetching annotations... >> Read 334620 records >> Error en .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, : >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. >> Adem?s: Mensajes de aviso perdidos >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", : >> Your organism has no mapping defined to perform the validity check for the UCSC compliance of the chromosome name. >> Defined organism's mapping can be listed using the 'knownOrganisms' function. >> To benefit from the validity check, you can provide a 'chr.map' to your 'easyRNASeq' function call. >> As you did not do so, 'validity.check' is turned off >> 2: In .Method(..., deparse.level = deparse.level) : >> number of columns of result is not a multiple of vector length (arg 1) >> >>> traceback() >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the required fields: ", >> paste(fields, collapse = ", "), ".", sep = "")) >> 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, >> ...) >> 4: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, >> ignoreWarnings = ignoreWarnings, ...) >> 3: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, >> ignoreWarnings = ignoreWarnings, ...) >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", >> annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", >> count = "genes", summarization = "geneModels", format = "bam", >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), >> outputFormat = "DESeq", conditions = conditions) >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", >> annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", >> count = "genes", summarization = "geneModels", format = "bam", >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), >> outputFormat = "DESeq", conditions = conditions) >> >> >> -- output of sessionInfo(): >> >>> sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 easyRNASeq_1.8.1 ShortRead_1.20.0 >> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 DESeq_1.14.0 lattice_0.20-23 >> [9] locfit_1.5-9.1 Biostrings_2.30.0 XVector_0.2.0 IRanges_1.20.4 >> [13] edgeR_3.4.0 limma_3.18.2 biomaRt_2.18.0 Biobase_2.22.0 >> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 intervals_0.14.0 BiocInstaller_1.12.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 >> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 splines_3.0.2 >> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Why not use rtracklayer / GenomicFeatures for parsing GTF? That format is tough; no reason for everyone to take it on by themselves. On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme@embl.de> wrote: > Hej Natalia! > > There were a number of lines in that particular gtf that violated the > assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' > column were always set and one of the gene name had a space character in > it. Ive made the parsing of gtf file annotation more flexible/lenient and > that should resolve that particular issue you had. The changes should > propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. > > Rather than using the geneModel, which implementation is old and has > gotten slow because of changes in the underlying architecture, I prefer an > approach where I > 1) filter the gtf / gff annotation file for only those lines Im > interested in (e.g. of type exon, mRNA and gene for a gff file) > 2) collapse every exon of a gene into what I call now a synthetic > transcript. The reason for changing the naming from geneModel to synthetic > transcript is that gene model has different meaning depending on the > field of application. > 3) use easyRNASeq with the modified annotation and the transcript count > method. > > Ive detailed that procedure in the vignette section 7.1 . > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme@embl.de> wrote: > > > Hej Natalia! > > > > This is not the first time that Ive seen this error on the list, but > Ive not been able to reproduce it so far with my own data. Would you mind > sharing some data offline, just an excerpt of your files would do. If > thats OK, I can create and give access to a folder on my box account. > > > > I had already relaxed the constraint on parsing a gtf file in a previous > update but forgot to reflect the changes in the error message. Only the > gene_id and transcript_id are actually mandatory. I would not expect any > issue with the EnsEMBL gtf file, but Ill have a look at why it fails for > Gallus galls one and let you know asap. > > > > This as nothing to do with this error, but by looking at your command > line, I saw that you provide a character vector to the chrSize argument. > This is not necessary as this information is extracted from the bam file in > your case, so you can just drop the chrSizes = chrSizes from your command > line. Ive added some extra check in the method to detect this now. Thanks > :-) > > > > Cheers, > > > > Nico > > > > --------------------------------------------------------------- > > Nicolas Delhomme > > > > Genome Biology Computational Support > > > > European Molecular Biology Laboratory > > > > Tel: +49 6221 387 8310 > > Email: nicolas.delhomme@embl.de > > Meyerhofstrasse 1 - Postfach 10.2209 > > 69102 Heidelberg, Germany > > --------------------------------------------------------------- > > > > > > > > > > > > On 11 Nov 2013, at 16:31, Natalia [guest] <guest@bioconductor.org> > wrote: > > > >> > >> Dear all, > >> I would like to make a count table to use it in DESeq. I´ve tried to > use easyRNAseq but I have a problem with the annotation file. Ive > downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run > into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot > is empty, I modified the file and added chr before the chromosome number. > The next problem was this: > >> > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id, exon_number, gene_name. > >> > >> To solve this problem: > >> - I deleted all the entries without gene_name (first example): > >> > >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; > exon_number "1"; gene_biotype "protein_coding"; exon_id > "ENSGALE00000301221"; > >> > >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; > exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; > transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; > >> > >> - I checked the chromosome numbers and deleted the entries that didnt > match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I cant find any > entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf file, I > dont know if it is a problem): > >> > >> - I searched for semicolons and single quotes in the gene names, but > I didnt find any on the final file. > >> > >> - I deleted all the columns after gene_name. > >> > >> So finally the annotation file entries look like this: > >> chr1 protein_coding exon 19962541 19963992 > >> . + . gene_id "ENSGALG00000000003"; > transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; > >> > >> Nothing works; the error message is always the same. So, I dont know > what else I can do. Could you please help me? > >> Thank you in advance! > >> Cheers > >> > >> Natalia > >> > >> > >> here is my code: > >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", > chrSizes="chrSizes", annotationMethod="gtf", > annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", > summarization="geneModels", format="bam", gapped=TRUE, > filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", > conditions=conditions) > >> Checking arguments... > >> Fetching annotations... > >> Read 334620 records > >> Error en .getGtfRange(organismName(obj), filename = filename, > ignoreWarnings = ignoreWarnings, : > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id, exon_number, gene_name. > >> Además: Mensajes de aviso perdidos > >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", : > >> Your organism has no mapping defined to perform the validity check for > the UCSC compliance of the chromosome name. > >> Defined organism's mapping can be listed using the 'knownOrganisms' > function. > >> To benefit from the validity check, you can provide a 'chr.map' to your > 'easyRNASeq' function call. > >> As you did not do so, 'validity.check' is turned off > >> 2: In .Method(..., deparse.level = deparse.level) : > >> number of columns of result is not a multiple of vector length (arg 1) > >> > >>> traceback() > >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the > required fields: ", > >> paste(fields, collapse = ", "), ".", sep = "")) > >> 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings > = ignoreWarnings, > >> ...) > >> 4: fetchAnnotation(obj, method = annotationMethod, filename = > annotationFile, > >> ignoreWarnings = ignoreWarnings, ...) > >> 3: fetchAnnotation(obj, method = annotationMethod, filename = > annotationFile, > >> ignoreWarnings = ignoreWarnings, ...) > >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", > >> annotationMethod = "gtf", annotationFile = " > Gallus_gallus.Galgal4.73.gtf ", > >> count = "genes", summarization = "geneModels", format = "bam", > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > >> outputFormat = "DESeq", conditions = conditions) > >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", > >> annotationMethod = "gtf", annotationFile = " > Gallus_gallus.Galgal4.73.gtf ", > >> count = "genes", summarization = "geneModels", format = "bam", > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > >> outputFormat = "DESeq", conditions = conditions) > >> > >> > >> -- output of sessionInfo(): > >> > >>> sessionInfo() > >> R version 3.0.2 (2013-09-25) > >> Platform: x86_64-w64-mingw32/x64 (64-bit) > >> > >> locale: > >> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > LC_TIME=Spanish_Spain.1252 > >> > >> attached base packages: > >> [1] parallel stats graphics grDevices utils datasets methods > base > >> > >> other attached packages: > >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 > easyRNASeq_1.8.1 ShortRead_1.20.0 > >> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 > DESeq_1.14.0 lattice_0.20-23 > >> [9] locfit_1.5-9.1 Biostrings_2.30.0 > XVector_0.2.0 IRanges_1.20.4 > >> [13] edgeR_3.4.0 limma_3.18.2 > biomaRt_2.18.0 Biobase_2.22.0 > >> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 > intervals_0.14.0 BiocInstaller_1.12.0 > >> > >> loaded via a namespace (and not attached): > >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 > DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > >> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 > RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 > splines_3.0.2 > >> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 > XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 > >> > >> > >> -- > >> Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hej Michael, Good question really. I have a number of reason for this: 1) I?ve been using the genomeIntervals readGff3 function for that - for years now - and I?ve always been satisfied by its performance, especially when parsing the gff/gtf ninth column. The parseGffAttribute and getGffAttribute functions are extremely convenient. I honestly haven?t checked if there was any recent development in rtracklayer / GenomicFeatures similar to these functions. If there were not, I think they would be a great addition to either package. 2) As you might guess it?s essentially historical, back when I started that package in 2009, there was not today?s fantastic set of packages. 3) As you painfully know, there?s about as many gff format as they are gff files, and because my package is a pipeline I really want to make sure that it?s output is consistent, hence I have strict requirement with regards to the gff/gtf format I accept. Which means that times and again, I have to do slight adjustment but I prefer that over outputting garbage. 4) RNA-Seq analyses are filled with pitfalls, hence I think it is essential that users understand the data formats they handle and actually what these analyses are all about. I don?t want them to use my package as they would use a black box. 5) It?s educational. There?s a vignette that describes how to parse and convert gff/gtf annotation in the minimal gff/gtf formatted file that would suit my package Well, I suppose it?s more than you asked for, but here are my reasons ;-) You?re welcome to comment and I?d be happy to look again at rtracklayer (been through GenomicFeatures recently and I like it much) if you would advise me so. Have a nice WE, Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Why not use rtracklayer / GenomicFeatures for parsing GTF? That format is tough; no reason for everyone to take it on by themselves. > > > > > On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > Hej Natalia! > > There were a number of lines in that particular gtf that violated the assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' column were always set and one of the gene name had a space character in it. I?ve made the parsing of gtf file annotation more flexible/lenient and that should resolve that particular issue you had. The changes should propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. > > Rather than using the geneModel, which implementation is old and has gotten slow because of changes in the underlying architecture, I prefer an approach where I > 1) filter the gtf / gff annotation file for only those lines I?m interested in (e.g. of type exon, mRNA and gene for a gff file) > 2) collapse every exon of a gene into what I call now a ?synthetic transcript?. The reason for changing the naming from geneModel to synthetic transcript is that ?gene model? has different meaning depending on the field of application. > 3) use easyRNASeq with the modified annotation and the ?transcript? count method. > > I?ve detailed that procedure in the vignette section 7.1 . > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > > > Hej Natalia! > > > > This is not the first time that I?ve seen this error on the list, but I?ve not been able to reproduce it so far with my own data. Would you mind sharing some data offline, just an excerpt of your files would do. If that?s OK, I can create and give access to a folder on my box account. > > > > I had already relaxed the constraint on parsing a gtf file in a previous update but forgot to reflect the changes in the error message. Only the gene_id and transcript_id are actually mandatory. I would not expect any issue with the EnsEMBL gtf file, but I?ll have a look at why it fails for Gallus galls one and let you know asap. > > > > This as nothing to do with this error, but by looking at your command line, I saw that you provide a character vector to the chrSize argument. This is not necessary as this information is extracted from the bam file in your case, so you can just drop the chrSizes = ?chrSizes? from your command line. I?ve added some extra check in the method to detect this now. Thanks :-) > > > > Cheers, > > > > Nico > > > > --------------------------------------------------------------- > > Nicolas Delhomme > > > > Genome Biology Computational Support > > > > European Molecular Biology Laboratory > > > > Tel: +49 6221 387 8310 > > Email: nicolas.delhomme at embl.de > > Meyerhofstrasse 1 - Postfach 10.2209 > > 69102 Heidelberg, Germany > > --------------------------------------------------------------- > > > > > > > > > > > > On 11 Nov 2013, at 16:31, Natalia [guest] <guest at="" bioconductor.org=""> wrote: > > > >> > >> Dear all, > >> I would like to make a count table to use it in DESeq. I?ve tried to use easyRNAseq but I have a problem with the annotation file. I?ve downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot is empty, I modified the file and added chr before the chromosome number. The next problem was this: > >> > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. > >> > >> To solve this problem: > >> - I deleted all the entries without gene_name (first example): > >> > >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000301221"; > >> > >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; > >> > >> - I checked the chromosome numbers and deleted the entries that didn?t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can?t find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf file, I don?t know if it is a problem): > >> > >> - I searched for semicolons and single quotes ? in the gene names, but I didn?t find any on the final file. > >> > >> - I deleted all the columns after gene_name. > >> > >> So finally the annotation file entries look like this: > >> chr1 protein_coding exon 19962541 19963992 > >> . + . gene_id "ENSGALG00000000003"; transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; > >> > >> Nothing works; the error message is always the same. So, I don?t know what else I can do. Could you please help me? > >> Thank you in advance! > >> Cheers > >> > >> Natalia > >> > >> > >> here is my code: > >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", chrSizes="chrSizes", annotationMethod="gtf", annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", summarization="geneModels", format="bam", gapped=TRUE, filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", conditions=conditions) > >> Checking arguments... > >> Fetching annotations... > >> Read 334620 records > >> Error en .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, : > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name. > >> Adem?s: Mensajes de aviso perdidos > >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", : > >> Your organism has no mapping defined to perform the validity check for the UCSC compliance of the chromosome name. > >> Defined organism's mapping can be listed using the 'knownOrganisms' function. > >> To benefit from the validity check, you can provide a 'chr.map' to your 'easyRNASeq' function call. > >> As you did not do so, 'validity.check' is turned off > >> 2: In .Method(..., deparse.level = deparse.level) : > >> number of columns of result is not a multiple of vector length (arg 1) > >> > >>> traceback() > >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the required fields: ", > >> paste(fields, collapse = ", "), ".", sep = "")) > >> 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, > >> ...) > >> 4: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, > >> ignoreWarnings = ignoreWarnings, ...) > >> 3: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, > >> ignoreWarnings = ignoreWarnings, ...) > >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", > >> annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", > >> count = "genes", summarization = "geneModels", format = "bam", > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > >> outputFormat = "DESeq", conditions = conditions) > >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", > >> annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", > >> count = "genes", summarization = "geneModels", format = "bam", > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > >> outputFormat = "DESeq", conditions = conditions) > >> > >> > >> -- output of sessionInfo(): > >> > >>> sessionInfo() > >> R version 3.0.2 (2013-09-25) > >> Platform: x86_64-w64-mingw32/x64 (64-bit) > >> > >> locale: > >> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252 > >> > >> attached base packages: > >> [1] parallel stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 easyRNASeq_1.8.1 ShortRead_1.20.0 > >> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 DESeq_1.14.0 lattice_0.20-23 > >> [9] locfit_1.5-9.1 Biostrings_2.30.0 XVector_0.2.0 IRanges_1.20.4 > >> [13] edgeR_3.4.0 limma_3.18.2 biomaRt_2.18.0 Biobase_2.22.0 > >> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 intervals_0.14.0 BiocInstaller_1.12.0 > >> > >> loaded via a namespace (and not attached): > >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > >> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 splines_3.0.2 > >> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 > >> > >> > >> -- > >> Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.2 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

It might be worth taking a look at rtracklayer and the TranscriptDb stuff in GenomicFeatures. It could save you time, and if you notice any deficiencies in rtracklayer, it would help me. For example, if the attribute parsing is a bottleneck, I can push it down to C. Michael On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delhomme@embl.de> wrote: > Hej Michael, > > Good question really. I have a number of reason for this: > > 1) Ive been using the genomeIntervals readGff3 function for that - for > years now - and Ive always been satisfied by its performance, especially > when parsing the gff/gtf ninth column. The parseGffAttribute and > getGffAttribute functions are extremely convenient. I honestly havent > checked if there was any recent development in rtracklayer / > GenomicFeatures similar to these functions. If there were not, I think they > would be a great addition to either package. > > 2) As you might guess its essentially historical, back when I started > that package in 2009, there was not todays fantastic set of packages. > > 3) As you painfully know, theres about as many gff format as they are gff > files, and because my package is a pipeline I really want to make sure that > its output is consistent, hence I have strict requirement with regards to > the gff/gtf format I accept. Which means that times and again, I have to do > slight adjustment but I prefer that over outputting garbage. > > 4) RNA-Seq analyses are filled with pitfalls, hence I think it is > essential that users understand the data formats they handle and actually > what these analyses are all about. I dont want them to use my package as > they would use a black box. > > 5) Its educational. Theres a vignette that describes how to parse and > convert gff/gtf annotation in the minimal gff/gtf formatted file that would > suit my package > > Well, I suppose its more than you asked for, but here are my reasons ;-) > Youre welcome to comment and Id be happy to look again at rtracklayer > (been through GenomicFeatures recently and I like it much) if you would > advise me so. > > Have a nice WE, > > Cheers, > > Nico > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.michael@gene.com> > wrote: > > > Why not use rtracklayer / GenomicFeatures for parsing GTF? That format > is tough; no reason for everyone to take it on by themselves. > > > > > > > > > > On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme@embl.de> > wrote: > > Hej Natalia! > > > > There were a number of lines in that particular gtf that violated the > assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' > column were always set and one of the gene name had a space character in > it. Ive made the parsing of gtf file annotation more flexible/lenient and > that should resolve that particular issue you had. The changes should > propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. > > > > Rather than using the geneModel, which implementation is old and has > gotten slow because of changes in the underlying architecture, I prefer an > approach where I > > 1) filter the gtf / gff annotation file for only those lines Im > interested in (e.g. of type exon, mRNA and gene for a gff file) > > 2) collapse every exon of a gene into what I call now a synthetic > transcript. The reason for changing the naming from geneModel to synthetic > transcript is that gene model has different meaning depending on the > field of application. > > 3) use easyRNASeq with the modified annotation and the transcript > count method. > > > > Ive detailed that procedure in the vignette section 7.1 . > > > > Cheers, > > > > Nico > > > > --------------------------------------------------------------- > > Nicolas Delhomme > > > > Genome Biology Computational Support > > > > European Molecular Biology Laboratory > > > > Tel: +49 6221 387 8310 > > Email: nicolas.delhomme@embl.de > > Meyerhofstrasse 1 - Postfach 10.2209 > > 69102 Heidelberg, Germany > > --------------------------------------------------------------- > > > > > > > > > > > > On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme@embl.de> wrote: > > > > > Hej Natalia! > > > > > > This is not the first time that Ive seen this error on the list, but > Ive not been able to reproduce it so far with my own data. Would you mind > sharing some data offline, just an excerpt of your files would do. If > thats OK, I can create and give access to a folder on my box account. > > > > > > I had already relaxed the constraint on parsing a gtf file in a > previous update but forgot to reflect the changes in the error message. > Only the gene_id and transcript_id are actually mandatory. I would not > expect any issue with the EnsEMBL gtf file, but Ill have a look at why it > fails for Gallus galls one and let you know asap. > > > > > > This as nothing to do with this error, but by looking at your command > line, I saw that you provide a character vector to the chrSize argument. > This is not necessary as this information is extracted from the bam file in > your case, so you can just drop the chrSizes = chrSizes from your command > line. Ive added some extra check in the method to detect this now. Thanks > :-) > > > > > > Cheers, > > > > > > Nico > > > > > > --------------------------------------------------------------- > > > Nicolas Delhomme > > > > > > Genome Biology Computational Support > > > > > > European Molecular Biology Laboratory > > > > > > Tel: +49 6221 387 8310 > > > Email: nicolas.delhomme@embl.de > > > Meyerhofstrasse 1 - Postfach 10.2209 > > > 69102 Heidelberg, Germany > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > > > On 11 Nov 2013, at 16:31, Natalia [guest] <guest@bioconductor.org> > wrote: > > > > > >> > > >> Dear all, > > >> I would like to make a count table to use it in DESeq. I´ve tried to > use easyRNAseq but I have a problem with the annotation file. Ive > downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run > into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot > is empty, I modified the file and added chr before the chromosome number. > The next problem was this: > > >> > > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id, exon_number, gene_name. > > >> > > >> To solve this problem: > > >> - I deleted all the entries without gene_name (first example): > > >> > > >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; > exon_number "1"; gene_biotype "protein_coding"; exon_id > "ENSGALE00000301221"; > > >> > > >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; > exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; > transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; > > >> > > >> - I checked the chromosome numbers and deleted the entries that > didnt match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I cant > find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf > file, I dont know if it is a problem): > > >> > > >> - I searched for semicolons and single quotes in the gene names, > but I didnt find any on the final file. > > >> > > >> - I deleted all the columns after gene_name. > > >> > > >> So finally the annotation file entries look like this: > > >> chr1 protein_coding exon 19962541 19963992 > > >> . + . gene_id "ENSGALG00000000003"; > transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; > > >> > > >> Nothing works; the error message is always the same. So, I dont know > what else I can do. Could you please help me? > > >> Thank you in advance! > > >> Cheers > > >> > > >> Natalia > > >> > > >> > > >> here is my code: > > >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", > chrSizes="chrSizes", annotationMethod="gtf", > annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", > summarization="geneModels", format="bam", gapped=TRUE, > filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", > conditions=conditions) > > >> Checking arguments... > > >> Fetching annotations... > > >> Read 334620 records > > >> Error en .getGtfRange(organismName(obj), filename = filename, > ignoreWarnings = ignoreWarnings, : > > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id, exon_number, gene_name. > > >> Además: Mensajes de aviso perdidos > > >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", : > > >> Your organism has no mapping defined to perform the validity check > for the UCSC compliance of the chromosome name. > > >> Defined organism's mapping can be listed using the 'knownOrganisms' > function. > > >> To benefit from the validity check, you can provide a 'chr.map' to > your 'easyRNASeq' function call. > > >> As you did not do so, 'validity.check' is turned off > > >> 2: In .Method(..., deparse.level = deparse.level) : > > >> number of columns of result is not a multiple of vector length (arg 1) > > >> > > >>> traceback() > > >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the > required fields: ", > > >> paste(fields, collapse = ", "), ".", sep = "")) > > >> 5: .getGtfRange(organismName(obj), filename = filename, > ignoreWarnings = ignoreWarnings, > > >> ...) > > >> 4: fetchAnnotation(obj, method = annotationMethod, filename = > annotationFile, > > >> ignoreWarnings = ignoreWarnings, ...) > > >> 3: fetchAnnotation(obj, method = annotationMethod, filename = > annotationFile, > > >> ignoreWarnings = ignoreWarnings, ...) > > >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", > > >> annotationMethod = "gtf", annotationFile = " > Gallus_gallus.Galgal4.73.gtf ", > > >> count = "genes", summarization = "geneModels", format = "bam", > > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > > >> outputFormat = "DESeq", conditions = conditions) > > >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", > > >> annotationMethod = "gtf", annotationFile = " > Gallus_gallus.Galgal4.73.gtf ", > > >> count = "genes", summarization = "geneModels", format = "bam", > > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > > >> outputFormat = "DESeq", conditions = conditions) > > >> > > >> > > >> -- output of sessionInfo(): > > >> > > >>> sessionInfo() > > >> R version 3.0.2 (2013-09-25) > > >> Platform: x86_64-w64-mingw32/x64 (64-bit) > > >> > > >> locale: > > >> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > LC_TIME=Spanish_Spain.1252 > > >> > > >> attached base packages: > > >> [1] parallel stats graphics grDevices utils datasets > methods base > > >> > > >> other attached packages: > > >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 > easyRNASeq_1.8.1 ShortRead_1.20.0 > > >> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 > DESeq_1.14.0 lattice_0.20-23 > > >> [9] locfit_1.5-9.1 Biostrings_2.30.0 > XVector_0.2.0 IRanges_1.20.4 > > >> [13] edgeR_3.4.0 limma_3.18.2 > biomaRt_2.18.0 Biobase_2.22.0 > > >> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 > intervals_0.14.0 BiocInstaller_1.12.0 > > >> > > >> loaded via a namespace (and not attached): > > >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 > DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > > >> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 > RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 > splines_3.0.2 > > >> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 > XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 > > >> > > >> > > >> -- > > >> Sent via the guest posting facility at bioconductor.org. > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Nico! Im sorry to bother you again but Im still having the same error with easyRNASeq. The only difference is that the error message doesnt include the exon_number and gene_name as required fields. Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id. I havent tried rtracklayer yet. Thank you for your help! Cheers Natalia Date: Fri, 15 Nov 2013 09:58:13 -0800 From: lawrence.michael@gene.com To: delhomme@embl.de CC: lawrence.michael@gene.com; bioconductor@r-project.org Subject: Re: [BioC] GTF file error when using easyRNAseq It might be worth taking a look at rtracklayer and the TranscriptDb stuff in GenomicFeatures. It could save you time, and if you notice any deficiencies in rtracklayer, it would help me. For example, if the attribute parsing is a bottleneck, I can push it down to C. Michael On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delhomme@embl.de> wrote: > Hej Michael, > > Good question really. I have a number of reason for this: > > 1) Ive been using the genomeIntervals readGff3 function for that - for > years now - and Ive always been satisfied by its performance, especially > when parsing the gff/gtf ninth column. The parseGffAttribute and > getGffAttribute functions are extremely convenient. I honestly havent > checked if there was any recent development in rtracklayer / > GenomicFeatures similar to these functions. If there were not, I think they > would be a great addition to either package. > > 2) As you might guess its essentially historical, back when I started > that package in 2009, there was not todays fantastic set of packages. > > 3) As you painfully know, theres about as many gff format as they are gff > files, and because my package is a pipeline I really want to make sure that > its output is consistent, hence I have strict requirement with regards to > the gff/gtf format I accept. Which means that times and again, I have to do > slight adjustment but I prefer that over outputting garbage. > > 4) RNA-Seq analyses are filled with pitfalls, hence I think it is > essential that users understand the data formats they handle and actually > what these analyses are all about. I dont want them to use my package as > they would use a black box. > > 5) Its educational. Theres a vignette that describes how to parse and > convert gff/gtf annotation in the minimal gff/gtf formatted file that would > suit my package > > Well, I suppose its more than you asked for, but here are my reasons ;-) > Youre welcome to comment and Id be happy to look again at rtracklayer > (been through GenomicFeatures recently and I like it much) if you would > advise me so. > > Have a nice WE, > > Cheers, > > Nico > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.michael@gene.com> > wrote: > > > Why not use rtracklayer / GenomicFeatures for parsing GTF? That format > is tough; no reason for everyone to take it on by themselves. > > > > > > > > > > On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme@embl.de> > wrote: > > Hej Natalia! > > > > There were a number of lines in that particular gtf that violated the > assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' > column were always set and one of the gene name had a space character in > it. Ive made the parsing of gtf file annotation more flexible/lenient and > that should resolve that particular issue you had. The changes should > propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. > > > > Rather than using the geneModel, which implementation is old and has > gotten slow because of changes in the underlying architecture, I prefer an > approach where I > > 1) filter the gtf / gff annotation file for only those lines Im > interested in (e.g. of type exon, mRNA and gene for a gff file) > > 2) collapse every exon of a gene into what I call now a synthetic > transcript. The reason for changing the naming from geneModel to synthetic > transcript is that gene model has different meaning depending on the > field of application. > > 3) use easyRNASeq with the modified annotation and the transcript > count method. > > > > Ive detailed that procedure in the vignette section 7.1 . > > > > Cheers, > > > > Nico > > > > --------------------------------------------------------------- > > Nicolas Delhomme > > > > Genome Biology Computational Support > > > > European Molecular Biology Laboratory > > > > Tel: +49 6221 387 8310 > > Email: nicolas.delhomme@embl.de > > Meyerhofstrasse 1 - Postfach 10.2209 > > 69102 Heidelberg, Germany > > --------------------------------------------------------------- > > > > > > > > > > > > On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme@embl.de> wrote: > > > > > Hej Natalia! > > > > > > This is not the first time that Ive seen this error on the list, but > Ive not been able to reproduce it so far with my own data. Would you mind > sharing some data offline, just an excerpt of your files would do. If > thats OK, I can create and give access to a folder on my box account. > > > > > > I had already relaxed the constraint on parsing a gtf file in a > previous update but forgot to reflect the changes in the error message. > Only the gene_id and transcript_id are actually mandatory. I would not > expect any issue with the EnsEMBL gtf file, but Ill have a look at why it > fails for Gallus galls one and let you know asap. > > > > > > This as nothing to do with this error, but by looking at your command > line, I saw that you provide a character vector to the chrSize argument. > This is not necessary as this information is extracted from the bam file in > your case, so you can just drop the chrSizes = chrSizes from your command > line. Ive added some extra check in the method to detect this now. Thanks > :-) > > > > > > Cheers, > > > > > > Nico > > > > > > --------------------------------------------------------------- > > > Nicolas Delhomme > > > > > > Genome Biology Computational Support > > > > > > European Molecular Biology Laboratory > > > > > > Tel: +49 6221 387 8310 > > > Email: nicolas.delhomme@embl.de > > > Meyerhofstrasse 1 - Postfach 10.2209 > > > 69102 Heidelberg, Germany > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > > > On 11 Nov 2013, at 16:31, Natalia [guest] <guest@bioconductor.org> > wrote: > > > > > >> > > >> Dear all, > > >> I would like to make a count table to use it in DESeq. I´ve tried to > use easyRNAseq but I have a problem with the annotation file. Ive > downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run > into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot > is empty, I modified the file and added chr before the chromosome number. > The next problem was this: > > >> > > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id, exon_number, gene_name. > > >> > > >> To solve this problem: > > >> - I deleted all the entries without gene_name (first example): > > >> > > >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; > exon_number "1"; gene_biotype "protein_coding"; exon_id > "ENSGALE00000301221"; > > >> > > >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; > exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; > transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; > > >> > > >> - I checked the chromosome numbers and deleted the entries that > didnt match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I cant > find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf > file, I dont know if it is a problem): > > >> > > >> - I searched for semicolons and single quotes in the gene names, > but I didnt find any on the final file. > > >> > > >> - I deleted all the columns after gene_name. > > >> > > >> So finally the annotation file entries look like this: > > >> chr1 protein_coding exon 19962541 19963992 > > >> . + . gene_id "ENSGALG00000000003"; > transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; > > >> > > >> Nothing works; the error message is always the same. So, I dont know > what else I can do. Could you please help me? > > >> Thank you in advance! > > >> Cheers > > >> > > >> Natalia > > >> > > >> > > >> here is my code: > > >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", > chrSizes="chrSizes", annotationMethod="gtf", > annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", > summarization="geneModels", format="bam", gapped=TRUE, > filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", > conditions=conditions) > > >> Checking arguments... > > >> Fetching annotations... > > >> Read 334620 records > > >> Error en .getGtfRange(organismName(obj), filename = filename, > ignoreWarnings = ignoreWarnings, : > > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id, exon_number, gene_name. > > >> Además: Mensajes de aviso perdidos > > >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", : > > >> Your organism has no mapping defined to perform the validity check > for the UCSC compliance of the chromosome name. > > >> Defined organism's mapping can be listed using the 'knownOrganisms' > function. > > >> To benefit from the validity check, you can provide a 'chr.map' to > your 'easyRNASeq' function call. > > >> As you did not do so, 'validity.check' is turned off > > >> 2: In .Method(..., deparse.level = deparse.level) : > > >> number of columns of result is not a multiple of vector length (arg 1) > > >> > > >>> traceback() > > >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the > required fields: ", > > >> paste(fields, collapse = ", "), ".", sep = "")) > > >> 5: .getGtfRange(organismName(obj), filename = filename, > ignoreWarnings = ignoreWarnings, > > >> ...) > > >> 4: fetchAnnotation(obj, method = annotationMethod, filename = > annotationFile, > > >> ignoreWarnings = ignoreWarnings, ...) > > >> 3: fetchAnnotation(obj, method = annotationMethod, filename = > annotationFile, > > >> ignoreWarnings = ignoreWarnings, ...) > > >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", > > >> annotationMethod = "gtf", annotationFile = " > Gallus_gallus.Galgal4.73.gtf ", > > >> count = "genes", summarization = "geneModels", format = "bam", > > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > > >> outputFormat = "DESeq", conditions = conditions) > > >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > "chrSizes", > > >> annotationMethod = "gtf", annotationFile = " > Gallus_gallus.Galgal4.73.gtf ", > > >> count = "genes", summarization = "geneModels", format = "bam", > > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > > >> outputFormat = "DESeq", conditions = conditions) > > >> > > >> > > >> -- output of sessionInfo(): > > >> > > >>> sessionInfo() > > >> R version 3.0.2 (2013-09-25) > > >> Platform: x86_64-w64-mingw32/x64 (64-bit) > > >> > > >> locale: > > >> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > LC_TIME=Spanish_Spain.1252 > > >> > > >> attached base packages: > > >> [1] parallel stats graphics grDevices utils datasets > methods base > > >> > > >> other attached packages: > > >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 > easyRNASeq_1.8.1 ShortRead_1.20.0 > > >> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 > DESeq_1.14.0 lattice_0.20-23 > > >> [9] locfit_1.5-9.1 Biostrings_2.30.0 > XVector_0.2.0 IRanges_1.20.4 > > >> [13] edgeR_3.4.0 limma_3.18.2 > biomaRt_2.18.0 Biobase_2.22.0 > > >> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 > intervals_0.14.0 BiocInstaller_1.12.0 > > >> > > >> loaded via a namespace (and not attached): > > >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 > DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > > >> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 > RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 > splines_3.0.2 > > >> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 > XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 > > >> > > >> > > >> -- > > >> Sent via the guest posting facility at bioconductor.org. > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.2 years ago Natalia Sevane ▴ 20

0

Entering edit mode

Hej Natalia! I can?t reproduce that error. Are you using the gtf file you had modified? I just downloaded it from ensembl, unzipped it and ran the bit of code that complains in your case and it went smoothly through. If you had done the same, then it?s very strange. In that case, could you run and paste the output of: easyRNASeq:::.getGtfRange(filename="Gallus_gallus.Galgal4.73.gtf") and easyRNASeq:::.getGtfRange Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Nathaniel Street Lab Department of Plant Physiology Ume? Plant Science Center Tel: +46 90 786 7989 Email: nicolas.delhomme at umu.se SLU - Ume? universitet Ume? S-901 87 Sweden --------------------------------------------------------------- On 19 Nov 2013, at 08:10, Natalia Sevane <nata_sf at="" hotmail.com=""> wrote: > > > Hi Nico! > > I?m sorry to bother you again but I?m > still having the same error with easyRNASeq. The only difference is that the > error message doesn?t include the exon_number and gene_name as required fields. > > > > Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > required fields: gene_id, transcript_id. > > > > I haven?t tried rtracklayer yet. > > Thank you for your help! > > Cheers > > > > Natalia > > > > Date: Fri, 15 Nov 2013 09:58:13 -0800 > From: lawrence.michael at gene.com > To: delhomme at embl.de > CC: lawrence.michael at gene.com; bioconductor at r-project.org > Subject: Re: [BioC] GTF file error when using easyRNAseq > > It might be worth taking a look at rtracklayer and the TranscriptDb stuff > in GenomicFeatures. It could save you time, and if you notice any > deficiencies in rtracklayer, it would help me. For example, if the > attribute parsing is a bottleneck, I can push it down to C. > > Michael > > On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: > >> Hej Michael, >> >> Good question really. I have a number of reason for this: >> >> 1) I?ve been using the genomeIntervals readGff3 function for that - for >> years now - and I?ve always been satisfied by its performance, especially >> when parsing the gff/gtf ninth column. The parseGffAttribute and >> getGffAttribute functions are extremely convenient. I honestly haven?t >> checked if there was any recent development in rtracklayer / >> GenomicFeatures similar to these functions. If there were not, I think they >> would be a great addition to either package. >> >> 2) As you might guess it?s essentially historical, back when I started >> that package in 2009, there was not today?s fantastic set of packages. >> >> 3) As you painfully know, there?s about as many gff format as they are gff >> files, and because my package is a pipeline I really want to make sure that >> it?s output is consistent, hence I have strict requirement with regards to >> the gff/gtf format I accept. Which means that times and again, I have to do >> slight adjustment but I prefer that over outputting garbage. >> >> 4) RNA-Seq analyses are filled with pitfalls, hence I think it is >> essential that users understand the data formats they handle and actually >> what these analyses are all about. I don?t want them to use my package as >> they would use a black box. >> >> 5) It?s educational. There?s a vignette that describes how to parse and >> convert gff/gtf annotation in the minimal gff/gtf formatted file that would >> suit my package >> >> Well, I suppose it?s more than you asked for, but here are my reasons ;-) >> You?re welcome to comment and I?d be happy to look again at rtracklayer >> (been through GenomicFeatures recently and I like it much) if you would >> advise me so. >> >> Have a nice WE, >> >> Cheers, >> >> Nico >> >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.michael at="" gene.com=""> >> wrote: >> >>> Why not use rtracklayer / GenomicFeatures for parsing GTF? That format >> is tough; no reason for everyone to take it on by themselves. >>> >>> >>> >>> >>> On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme at="" embl.de=""> >> wrote: >>> Hej Natalia! >>> >>> There were a number of lines in that particular gtf that violated the >> assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' >> column were always set and one of the gene name had a space character in >> it. I?ve made the parsing of gtf file annotation more flexible/lenient and >> that should resolve that particular issue you had. The changes should >> propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. >>> >>> Rather than using the geneModel, which implementation is old and has >> gotten slow because of changes in the underlying architecture, I prefer an >> approach where I >>> 1) filter the gtf / gff annotation file for only those lines I?m >> interested in (e.g. of type exon, mRNA and gene for a gff file) >>> 2) collapse every exon of a gene into what I call now a ?synthetic >> transcript?. The reason for changing the naming from geneModel to synthetic >> transcript is that ?gene model? has different meaning depending on the >> field of application. >>> 3) use easyRNASeq with the modified annotation and the ?transcript? >> count method. >>> >>> I?ve detailed that procedure in the vignette section 7.1 . >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme at="" embl.de=""> wrote: >>> >>>> Hej Natalia! >>>> >>>> This is not the first time that I?ve seen this error on the list, but >> I?ve not been able to reproduce it so far with my own data. Would you mind >> sharing some data offline, just an excerpt of your files would do. If >> that?s OK, I can create and give access to a folder on my box account. >>>> >>>> I had already relaxed the constraint on parsing a gtf file in a >> previous update but forgot to reflect the changes in the error message. >> Only the gene_id and transcript_id are actually mandatory. I would not >> expect any issue with the EnsEMBL gtf file, but I?ll have a look at why it >> fails for Gallus galls one and let you know asap. >>>> >>>> This as nothing to do with this error, but by looking at your command >> line, I saw that you provide a character vector to the chrSize argument. >> This is not necessary as this information is extracted from the bam file in >> your case, so you can just drop the chrSizes = ?chrSizes? from your command >> line. I?ve added some extra check in the method to detect this now. Thanks >> :-) >>>> >>>> Cheers, >>>> >>>> Nico >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delhomme at embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> --------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> >>>> On 11 Nov 2013, at 16:31, Natalia [guest] <guest at="" bioconductor.org=""> >> wrote: >>>> >>>>> >>>>> Dear all, >>>>> I would like to make a count table to use it in DESeq. I?ve tried to >> use easyRNAseq but I have a problem with the annotation file. I?ve >> downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run >> into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot >> is empty, I modified the file and added chr before the chromosome number. >> The next problem was this: >>>>> >>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the >> required fields: gene_id, transcript_id, exon_number, gene_name. >>>>> >>>>> To solve this problem: >>>>> - I deleted all the entries without gene_name (first example): >>>>> >>>>> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; >> exon_number "1"; gene_biotype "protein_coding"; exon_id >> "ENSGALE00000301221"; >>>>> >>>>> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; >> exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; >> transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; >>>>> >>>>> - I checked the chromosome numbers and deleted the entries that >> didn?t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can?t >> find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf >> file, I don?t know if it is a problem): >>>>> >>>>> - I searched for semicolons and single quotes ? in the gene names, >> but I didn?t find any on the final file. >>>>> >>>>> - I deleted all the columns after gene_name. >>>>> >>>>> So finally the annotation file entries look like this: >>>>> chr1 protein_coding exon 19962541 19963992 >>>>> . + . gene_id "ENSGALG00000000003"; >> transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; >>>>> >>>>> Nothing works; the error message is always the same. So, I don?t know >> what else I can do. Could you please help me? >>>>> Thank you in advance! >>>>> Cheers >>>>> >>>>> Natalia >>>>> >>>>> >>>>> here is my code: >>>>>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", >> chrSizes="chrSizes", annotationMethod="gtf", >> annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", >> summarization="geneModels", format="bam", gapped=TRUE, >> filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", >> conditions=conditions) >>>>> Checking arguments... >>>>> Fetching annotations... >>>>> Read 334620 records >>>>> Error en .getGtfRange(organismName(obj), filename = filename, >> ignoreWarnings = ignoreWarnings, : >>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the >> required fields: gene_id, transcript_id, exon_number, gene_name. >>>>> Adem?s: Mensajes de aviso perdidos >>>>> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = >> "chrSizes", : >>>>> Your organism has no mapping defined to perform the validity check >> for the UCSC compliance of the chromosome name. >>>>> Defined organism's mapping can be listed using the 'knownOrganisms' >> function. >>>>> To benefit from the validity check, you can provide a 'chr.map' to >> your 'easyRNASeq' function call. >>>>> As you did not do so, 'validity.check' is turned off >>>>> 2: In .Method(..., deparse.level = deparse.level) : >>>>> number of columns of result is not a multiple of vector length (arg 1) >>>>> >>>>>> traceback() >>>>> 6: stop(paste("Your gtf file: ", filename, " does not contain all the >> required fields: ", >>>>> paste(fields, collapse = ", "), ".", sep = "")) >>>>> 5: .getGtfRange(organismName(obj), filename = filename, >> ignoreWarnings = ignoreWarnings, >>>>> ...) >>>>> 4: fetchAnnotation(obj, method = annotationMethod, filename = >> annotationFile, >>>>> ignoreWarnings = ignoreWarnings, ...) >>>>> 3: fetchAnnotation(obj, method = annotationMethod, filename = >> annotationFile, >>>>> ignoreWarnings = ignoreWarnings, ...) >>>>> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = >> "chrSizes", >>>>> annotationMethod = "gtf", annotationFile = " >> Gallus_gallus.Galgal4.73.gtf ", >>>>> count = "genes", summarization = "geneModels", format = "bam", >>>>> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), >>>>> outputFormat = "DESeq", conditions = conditions) >>>>> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = >> "chrSizes", >>>>> annotationMethod = "gtf", annotationFile = " >> Gallus_gallus.Galgal4.73.gtf ", >>>>> count = "genes", summarization = "geneModels", format = "bam", >>>>> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), >>>>> outputFormat = "DESeq", conditions = conditions) >>>>> >>>>> >>>>> -- output of sessionInfo(): >>>>> >>>>>> sessionInfo() >>>>> R version 3.0.2 (2013-09-25) >>>>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 >> LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C >> LC_TIME=Spanish_Spain.1252 >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets >> methods base >>>>> >>>>> other attached packages: >>>>> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 >> easyRNASeq_1.8.1 ShortRead_1.20.0 >>>>> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 >> DESeq_1.14.0 lattice_0.20-23 >>>>> [9] locfit_1.5-9.1 Biostrings_2.30.0 >> XVector_0.2.0 IRanges_1.20.4 >>>>> [13] edgeR_3.4.0 limma_3.18.2 >> biomaRt_2.18.0 Biobase_2.22.0 >>>>> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 >> intervals_0.14.0 BiocInstaller_1.12.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 >> DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 >>>>> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 >> RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 >> splines_3.0.2 >>>>> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 >> XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 >>>>> >>>>> >>>>> -- >>>>> Sent via the guest posting facility at bioconductor.org. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago Nicolas Delhomme ▴ 320

Login before adding your answer.