Search
Question: MakeTranscriptDbFromGFF
0
gravatar for Ugo Borello
4.6 years ago by
Ugo Borello340
France
Ugo Borello340 wrote:
Good morning, I have a little problem creating a TranscriptDb object using the function makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps of my genomic alignments with genes. I ran: > txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", + exonRankAttributeName = "exon_number", + chrominfo = chrominfo, + dataSource = "UCSC", + species = "Mus musculus") And I got this error message: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 541775, 0 "chrominfo" was (info retrieved from the fasta genome file): > chrominfo chrom length is_circular 1 chr10 129993255 FALSE 2 chr11 121843856 FALSE 3 chr12 121257530 FALSE 4 chr13 120284312 FALSE 5 chr14 125194864 FALSE 6 chr15 103494974 FALSE 7 chr16 98319150 FALSE 8 chr17 95272651 FALSE 9 chr18 90772031 FALSE 10 chr19 61342430 FALSE 11 chr1 197195432 FALSE 12 chr2 181748087 FALSE 13 chr3 159599783 FALSE 14 chr4 155630120 FALSE 15 chr5 152537259 FALSE 16 chr6 149517037 FALSE 17 chr7 152524553 FALSE 18 chr8 131738871 FALSE 19 chr9 124076172 FALSE 20 chrM 16299 TRUE 21 chrX 166650296 FALSE 22 chrY 15902555 FALSE I ran it again without the "exonRankAttributeName" argument and I got: > txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", + chrominfo = chrominfo, + dataSource = "UCSC", + species = "Mus musculus") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Deducing exon rank from relative coordinates provided Prepare the 'metadata' data frame ... metadata: OK Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", : all the values in 'transcripts$tx_chrom' must be present in 'chrominfo$chrom' In addition: Warning message: In .deduceExonRankings(exs, format = "gtf") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName Without the "chrominfo" argument I got the same error message as the first time: > txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", + exonRankAttributeName = "exon_number", + dataSource = "UCSC", + species = "Mus musculus") Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 541775, 0 Finally when I eliminated both the "exonRankAttributeName" and the "chrominfo" arguments it worked but the warning reminded me of the "exonRankAttributeName" argument and the chromosome names are now different from the ones in the genome file and there is no info on their length > txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", + dataSource = "UCSC", + species = "Mus musculus") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Deducing exon rank from relative coordinates provided Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Warning messages: 1: In .deduceExonRankings(exs, format = "gtf") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName 2: In matchCircularity(chroms, circ_seqs) : None of the strings in your circ_seqs argument match your seqnames. > seqinfo(txdb) Seqinfo of length 32 seqnames seqlengths isCircular genome chr13 <na> FALSE <na> chr9 <na> FALSE <na> chr6 <na> FALSE <na> chrX <na> FALSE <na> chr17 <na> FALSE <na> chr2 <na> FALSE <na> chr7 <na> FALSE <na> chr18 <na> FALSE <na> chr8 <na> FALSE <na> ... ... ... ... chrY_random <na> FALSE <na> chrX_random <na> FALSE <na> chr5_random <na> FALSE <na> chr4_random <na> FALSE <na> chrY <na> FALSE <na> chr7_random <na> FALSE <na> chr17_random <na> FALSE <na> chr13_random <na> FALSE <na> chr1_random <na> FALSE <na> What am I doing wrong in my original call to makeTranscriptDbFromGFF? txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", exonRankAttributeName = "exon_number", chrominfo = chrominfo, dataSource = "UCSC", species = "Mus musculus") Why am I getting this unfair error message? Thank you for your help Ugo FYI, this is my annFile (My gtf annotation file was downloaded together with a fasta file containing the mouse genome from UCSC): > annFile GRanges with 595632 ranges and 9 metadata columns: seqnames ranges strand | source type score phase <rle> <iranges> <rle> | <factor> <factor> <numeric> <integer> [1] chr1 [3204563, 3207049] - | unknown exon <na> <na> [2] chr1 [3206103, 3206105] - | unknown stop_codon <na> <na> [3] chr1 [3206106, 3207049] - | unknown CDS <na> 2 [4] chr1 [3411783, 3411982] - | unknown CDS <na> 1 [5] chr1 [3411783, 3411982] - | unknown exon <na> <na> ... ... ... ... ... ... ... ... ... [595628] chrY_random [54422360, 54422362] + | unknown stop_codon <na> <na> [595629] chrY_random [58501955, 58502946] + | unknown exon <na> <na> [595630] chrY_random [58502132, 58502812] + | unknown CDS <na> 0 [595631] chrY_random [58502132, 58502134] + | unknown start_codon <na> <na> [595632] chrY_random [58502813, 58502815] + | unknown stop_codon <na> <na> gene_id transcript_id gene_name p_id tss_id <character> <character> <character> <character> <character> [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 ... ... ... ... ... ... [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491 [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 --- seqlengths: chr1 chr10 chr11 chr12 ... chrX_random chrY chrY_random NA NA NA NA ... NA NA NA > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 Biostrings_2.28.0 [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 GenomicRanges_1.12.2 [9] IRanges_1.18.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 BSgenome_1.28.0 [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 lattice_0.20-15 [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 stats4_3.0.0 [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
ADD COMMENTlink modified 4.6 years ago by Marc Carlson7.2k • written 4.6 years ago by Ugo Borello340
0
gravatar for Marc Carlson
4.6 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:
Hi Ugo, Which UCSC file was it that you were trying to process? Marc On 05/01/2013 02:21 AM, Ugo Borello wrote: > Good morning, > > I have a little problem creating a TranscriptDb object using the function > makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps > of my genomic alignments with genes. > > > I ran: > >> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", > + exonRankAttributeName = "exon_number", > + chrominfo = chrominfo, > + dataSource = "UCSC", > + species = "Mus musculus") > > And I got this error message: > Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 541775, 0 > > > "chrominfo" was (info retrieved from the fasta genome file): > >> chrominfo > chrom length is_circular > 1 chr10 129993255 FALSE > 2 chr11 121843856 FALSE > 3 chr12 121257530 FALSE > 4 chr13 120284312 FALSE > 5 chr14 125194864 FALSE > 6 chr15 103494974 FALSE > 7 chr16 98319150 FALSE > 8 chr17 95272651 FALSE > 9 chr18 90772031 FALSE > 10 chr19 61342430 FALSE > 11 chr1 197195432 FALSE > 12 chr2 181748087 FALSE > 13 chr3 159599783 FALSE > 14 chr4 155630120 FALSE > 15 chr5 152537259 FALSE > 16 chr6 149517037 FALSE > 17 chr7 152524553 FALSE > 18 chr8 131738871 FALSE > 19 chr9 124076172 FALSE > 20 chrM 16299 TRUE > 21 chrX 166650296 FALSE > 22 chrY 15902555 FALSE > > > I ran it again without the "exonRankAttributeName" argument and I got: > >> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", > + chrominfo = chrominfo, > + dataSource = "UCSC", > + species = "Mus musculus") > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", > : > all the values in 'transcripts$tx_chrom' must be present in > 'chrominfo$chrom' > In addition: Warning message: > In .deduceExonRankings(exs, format = "gtf") : > Infering Exon Rankings. If this is not what you expected, then please be > sure that you have provided a valid attribute for exonRankAttributeName > > > Without the "chrominfo" argument I got the same error message as the first > time: > >> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", > + exonRankAttributeName = "exon_number", > + dataSource = "UCSC", > + species = "Mus musculus") > > Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 541775, 0 > > > Finally when I eliminated both the "exonRankAttributeName" and the > "chrominfo" arguments it worked but the warning reminded me of the > "exonRankAttributeName" argument and the chromosome names are now different > from the ones in the genome file and there is no info on their length > >> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", > + dataSource = "UCSC", > + species = "Mus musculus") > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length > information is available. > Warning messages: > 1: In .deduceExonRankings(exs, format = "gtf") : > Infering Exon Rankings. If this is not what you expected, then please be > sure that you have provided a valid attribute for exonRankAttributeName > 2: In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. > >> seqinfo(txdb) > Seqinfo of length 32 > seqnames seqlengths isCircular genome > chr13 <na> FALSE <na> > chr9 <na> FALSE <na> > chr6 <na> FALSE <na> > chrX <na> FALSE <na> > chr17 <na> FALSE <na> > chr2 <na> FALSE <na> > chr7 <na> FALSE <na> > chr18 <na> FALSE <na> > chr8 <na> FALSE <na> > ... ... ... ... > chrY_random <na> FALSE <na> > chrX_random <na> FALSE <na> > chr5_random <na> FALSE <na> > chr4_random <na> FALSE <na> > chrY <na> FALSE <na> > chr7_random <na> FALSE <na> > chr17_random <na> FALSE <na> > chr13_random <na> FALSE <na> > chr1_random <na> FALSE <na> > > > > > What am I doing wrong in my original call to makeTranscriptDbFromGFF? > > txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", > exonRankAttributeName = "exon_number", > chrominfo = chrominfo, > dataSource = "UCSC", > species = "Mus musculus") > > Why am I getting this unfair error message? > Thank you for your help > Ugo > > > > > > > > > FYI, this is my annFile (My gtf annotation file was downloaded together > with a fasta file containing the mouse genome from UCSC): > >> annFile > GRanges with 595632 ranges and 9 metadata columns: > seqnames ranges strand | source type > score phase > <rle> <iranges> <rle> | <factor> <factor> > <numeric> <integer> > [1] chr1 [3204563, 3207049] - | unknown exon > <na> <na> > [2] chr1 [3206103, 3206105] - | unknown stop_codon > <na> <na> > [3] chr1 [3206106, 3207049] - | unknown CDS > <na> 2 > [4] chr1 [3411783, 3411982] - | unknown CDS > <na> 1 > [5] chr1 [3411783, 3411982] - | unknown exon > <na> <na> > ... ... ... ... ... ... ... > ... ... > [595628] chrY_random [54422360, 54422362] + | unknown stop_codon > <na> <na> > [595629] chrY_random [58501955, 58502946] + | unknown exon > <na> <na> > [595630] chrY_random [58502132, 58502812] + | unknown CDS > <na> 0 > [595631] chrY_random [58502132, 58502134] + | unknown start_codon > <na> <na> > [595632] chrY_random [58502813, 58502815] + | unknown stop_codon > <na> <na> > gene_id transcript_id gene_name p_id tss_id > <character> <character> <character> <character> <character> > [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 > [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 > [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 > [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 > [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 > ... ... ... ... ... ... > [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491 > [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 > [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 > [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 > [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 > --- > seqlengths: > chr1 chr10 chr11 chr12 ... chrX_random > chrY chrY_random > NA NA NA NA ... NA > NA NA > > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 > Biostrings_2.28.0 > [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 > GenomicRanges_1.12.2 > [9] IRanges_1.18.0 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 > BSgenome_1.28.0 > [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 > lattice_0.20-15 > [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 > stats4_3.0.0 > [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 4.6 years ago by Marc Carlson7.2k
Dear Marc Sorry I was not precise on the origin of the gtf annotation file; I got the gtf file from here: http://tophat.cbcb.umd.edu/igenomes.shtml And more precisely from the Mus musculus/UCSC/mm9 folder Here the description of the content of the folder: ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt I realized reading the README.txt file that actually the genes.gtf file I used is the Ensembl annotation of the mm9 release. So, I changed dataSource = "Ensembl" in the function call and I got the same error message: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 541775, 0 At the end of my previous email you have the result of calling: annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) Thank you Ugo > From: Marc Carlson <mcarlson at="" fhcrc.org=""> > Date: Wed, 01 May 2013 15:33:02 -0700 > To: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] MakeTranscriptDbFromGFF > > Hi Ugo, > > Which UCSC file was it that you were trying to process? > > > Marc > > > > On 05/01/2013 02:21 AM, Ugo Borello wrote: >> Good morning, >> >> I have a little problem creating a TranscriptDb object using the function >> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps >> of my genomic alignments with genes. >> >> >> I ran: >> >>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >> + exonRankAttributeName = "exon_number", >> + chrominfo = chrominfo, >> + dataSource = "UCSC", >> + species = "Mus musculus") >> >> And I got this error message: >> Error in data.frame(..., check.names = FALSE) : >> arguments imply differing number of rows: 541775, 0 >> >> >> "chrominfo" was (info retrieved from the fasta genome file): >> >>> chrominfo >> chrom length is_circular >> 1 chr10 129993255 FALSE >> 2 chr11 121843856 FALSE >> 3 chr12 121257530 FALSE >> 4 chr13 120284312 FALSE >> 5 chr14 125194864 FALSE >> 6 chr15 103494974 FALSE >> 7 chr16 98319150 FALSE >> 8 chr17 95272651 FALSE >> 9 chr18 90772031 FALSE >> 10 chr19 61342430 FALSE >> 11 chr1 197195432 FALSE >> 12 chr2 181748087 FALSE >> 13 chr3 159599783 FALSE >> 14 chr4 155630120 FALSE >> 15 chr5 152537259 FALSE >> 16 chr6 149517037 FALSE >> 17 chr7 152524553 FALSE >> 18 chr8 131738871 FALSE >> 19 chr9 124076172 FALSE >> 20 chrM 16299 TRUE >> 21 chrX 166650296 FALSE >> 22 chrY 15902555 FALSE >> >> >> I ran it again without the "exonRankAttributeName" argument and I got: >> >>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >> + chrominfo = chrominfo, >> + dataSource = "UCSC", >> + species = "Mus musculus") >> extracting transcript information >> Estimating transcript ranges. >> Extracting gene IDs >> Processing splicing information for gtf file. >> Deducing exon rank from relative coordinates provided >> Prepare the 'metadata' data frame ... metadata: OK >> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", >> : >> all the values in 'transcripts$tx_chrom' must be present in >> 'chrominfo$chrom' >> In addition: Warning message: >> In .deduceExonRankings(exs, format = "gtf") : >> Infering Exon Rankings. If this is not what you expected, then please be >> sure that you have provided a valid attribute for exonRankAttributeName >> >> >> Without the "chrominfo" argument I got the same error message as the first >> time: >> >>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >> + exonRankAttributeName = "exon_number", >> + dataSource = "UCSC", >> + species = "Mus musculus") >> >> Error in data.frame(..., check.names = FALSE) : >> arguments imply differing number of rows: 541775, 0 >> >> >> Finally when I eliminated both the "exonRankAttributeName" and the >> "chrominfo" arguments it worked but the warning reminded me of the >> "exonRankAttributeName" argument and the chromosome names are now different >> from the ones in the genome file and there is no info on their length >> >>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >> + dataSource = "UCSC", >> + species = "Mus musculus") >> extracting transcript information >> Estimating transcript ranges. >> Extracting gene IDs >> Processing splicing information for gtf file. >> Deducing exon rank from relative coordinates provided >> Prepare the 'metadata' data frame ... metadata: OK >> Now generating chrominfo from available sequence names. No chromosome length >> information is available. >> Warning messages: >> 1: In .deduceExonRankings(exs, format = "gtf") : >> Infering Exon Rankings. If this is not what you expected, then please be >> sure that you have provided a valid attribute for exonRankAttributeName >> 2: In matchCircularity(chroms, circ_seqs) : >> None of the strings in your circ_seqs argument match your seqnames. >> >>> seqinfo(txdb) >> Seqinfo of length 32 >> seqnames seqlengths isCircular genome >> chr13 <na> FALSE <na> >> chr9 <na> FALSE <na> >> chr6 <na> FALSE <na> >> chrX <na> FALSE <na> >> chr17 <na> FALSE <na> >> chr2 <na> FALSE <na> >> chr7 <na> FALSE <na> >> chr18 <na> FALSE <na> >> chr8 <na> FALSE <na> >> ... ... ... ... >> chrY_random <na> FALSE <na> >> chrX_random <na> FALSE <na> >> chr5_random <na> FALSE <na> >> chr4_random <na> FALSE <na> >> chrY <na> FALSE <na> >> chr7_random <na> FALSE <na> >> chr17_random <na> FALSE <na> >> chr13_random <na> FALSE <na> >> chr1_random <na> FALSE <na> >> >> >> >> >> What am I doing wrong in my original call to makeTranscriptDbFromGFF? >> >> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >> exonRankAttributeName = "exon_number", >> chrominfo = chrominfo, >> dataSource = "UCSC", >> species = "Mus musculus") >> >> Why am I getting this unfair error message? >> Thank you for your help >> Ugo >> >> >> >> >> >> >> >> >> FYI, this is my annFile (My gtf annotation file was downloaded together >> with a fasta file containing the mouse genome from UCSC): >> >>> annFile >> GRanges with 595632 ranges and 9 metadata columns: >> seqnames ranges strand | source type >> score phase >> <rle> <iranges> <rle> | <factor> <factor> >> <numeric> <integer> >> [1] chr1 [3204563, 3207049] - | unknown exon >> <na> <na> >> [2] chr1 [3206103, 3206105] - | unknown stop_codon >> <na> <na> >> [3] chr1 [3206106, 3207049] - | unknown CDS >> <na> 2 >> [4] chr1 [3411783, 3411982] - | unknown CDS >> <na> 1 >> [5] chr1 [3411783, 3411982] - | unknown exon >> <na> <na> >> ... ... ... ... ... ... ... >> ... ... >> [595628] chrY_random [54422360, 54422362] + | unknown stop_codon >> <na> <na> >> [595629] chrY_random [58501955, 58502946] + | unknown exon >> <na> <na> >> [595630] chrY_random [58502132, 58502812] + | unknown CDS >> <na> 0 >> [595631] chrY_random [58502132, 58502134] + | unknown start_codon >> <na> <na> >> [595632] chrY_random [58502813, 58502815] + | unknown stop_codon >> <na> <na> >> gene_id transcript_id gene_name p_id tss_id >> <character> <character> <character> <character> <character> >> [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >> [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >> [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >> [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >> [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >> ... ... ... ... ... ... >> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491 >> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >> --- >> seqlengths: >> chr1 chr10 chr11 chr12 ... chrX_random >> chrY chrY_random >> NA NA NA NA ... NA >> NA NA >> >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >> Biostrings_2.28.0 >> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >> GenomicRanges_1.12.2 >> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >> BSgenome_1.28.0 >> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >> lattice_0.20-15 >> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >> stats4_3.0.0 >> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 4.6 years ago by Ugo Borello340
Hi Ugo, The 15 GB tarball you sent me to contains several different GTF files for genes. I grabbed this one as it seemed to be the most recent: Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/ Genes/genes.gtf So looking at this file I can reproduce the problem you mentioned. And it shows me two problems. The 1st problem is that the only field that seems to contain any information about exon positions is called phase (and not "exon_number" as was in the code I see from before). There is not actually any field called "exon_number" in this file. Either way, one thing you can check is to make sure that the string you give here is the same as the appropriate field name that is used by the file. There is no way to know this information in advance since GTF does not specify how to encode this information (and in fact the information is entirely optional). The second problem is that even "phase" can't work right now since the authors of this gtf file have decided to only associate the exon rank information only with CDS and never with exons features. So there is not any actual 'exon' position information in this file, only information for CDS positions. Now that I see people doing these files in this way, I plan to enhance the parser so that it can process files of this kind. Is there a reason why you wanted to use this file and not the data contained in this package here? http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mm usculus.UCSC.mm9.knownGene.html Marc On 05/02/2013 02:59 AM, Ugo Borello wrote: > Dear Marc > Sorry I was not precise on the origin of the gtf annotation file; I got the > gtf file from here: > http://tophat.cbcb.umd.edu/igenomes.shtml > > And more precisely from the Mus musculus/UCSC/mm9 folder > Here the description of the content of the folder: > ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt > > I realized reading the README.txt file that actually the genes.gtf file I > used is the Ensembl annotation of the mm9 release. > So, I changed dataSource = "Ensembl" in the function call and I got the same > error message: > Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 541775, 0 > > At the end of my previous email you have the result of calling: > annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) > > > Thank you > > Ugo > >> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >> Date: Wed, 01 May 2013 15:33:02 -0700 >> To: <bioconductor at="" r-project.org=""> >> Subject: Re: [BioC] MakeTranscriptDbFromGFF >> >> Hi Ugo, >> >> Which UCSC file was it that you were trying to process? >> >> >> Marc >> >> >> >> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>> Good morning, >>> >>> I have a little problem creating a TranscriptDb object using the function >>> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps >>> of my genomic alignments with genes. >>> >>> >>> I ran: >>> >>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>> + exonRankAttributeName = "exon_number", >>> + chrominfo = chrominfo, >>> + dataSource = "UCSC", >>> + species = "Mus musculus") >>> >>> And I got this error message: >>> Error in data.frame(..., check.names = FALSE) : >>> arguments imply differing number of rows: 541775, 0 >>> >>> >>> "chrominfo" was (info retrieved from the fasta genome file): >>> >>>> chrominfo >>> chrom length is_circular >>> 1 chr10 129993255 FALSE >>> 2 chr11 121843856 FALSE >>> 3 chr12 121257530 FALSE >>> 4 chr13 120284312 FALSE >>> 5 chr14 125194864 FALSE >>> 6 chr15 103494974 FALSE >>> 7 chr16 98319150 FALSE >>> 8 chr17 95272651 FALSE >>> 9 chr18 90772031 FALSE >>> 10 chr19 61342430 FALSE >>> 11 chr1 197195432 FALSE >>> 12 chr2 181748087 FALSE >>> 13 chr3 159599783 FALSE >>> 14 chr4 155630120 FALSE >>> 15 chr5 152537259 FALSE >>> 16 chr6 149517037 FALSE >>> 17 chr7 152524553 FALSE >>> 18 chr8 131738871 FALSE >>> 19 chr9 124076172 FALSE >>> 20 chrM 16299 TRUE >>> 21 chrX 166650296 FALSE >>> 22 chrY 15902555 FALSE >>> >>> >>> I ran it again without the "exonRankAttributeName" argument and I got: >>> >>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>> + chrominfo = chrominfo, >>> + dataSource = "UCSC", >>> + species = "Mus musculus") >>> extracting transcript information >>> Estimating transcript ranges. >>> Extracting gene IDs >>> Processing splicing information for gtf file. >>> Deducing exon rank from relative coordinates provided >>> Prepare the 'metadata' data frame ... metadata: OK >>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", >>> : >>> all the values in 'transcripts$tx_chrom' must be present in >>> 'chrominfo$chrom' >>> In addition: Warning message: >>> In .deduceExonRankings(exs, format = "gtf") : >>> Infering Exon Rankings. If this is not what you expected, then please be >>> sure that you have provided a valid attribute for exonRankAttributeName >>> >>> >>> Without the "chrominfo" argument I got the same error message as the first >>> time: >>> >>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>> + exonRankAttributeName = "exon_number", >>> + dataSource = "UCSC", >>> + species = "Mus musculus") >>> >>> Error in data.frame(..., check.names = FALSE) : >>> arguments imply differing number of rows: 541775, 0 >>> >>> >>> Finally when I eliminated both the "exonRankAttributeName" and the >>> "chrominfo" arguments it worked but the warning reminded me of the >>> "exonRankAttributeName" argument and the chromosome names are now different >>> from the ones in the genome file and there is no info on their length >>> >>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>> + dataSource = "UCSC", >>> + species = "Mus musculus") >>> extracting transcript information >>> Estimating transcript ranges. >>> Extracting gene IDs >>> Processing splicing information for gtf file. >>> Deducing exon rank from relative coordinates provided >>> Prepare the 'metadata' data frame ... metadata: OK >>> Now generating chrominfo from available sequence names. No chromosome length >>> information is available. >>> Warning messages: >>> 1: In .deduceExonRankings(exs, format = "gtf") : >>> Infering Exon Rankings. If this is not what you expected, then please be >>> sure that you have provided a valid attribute for exonRankAttributeName >>> 2: In matchCircularity(chroms, circ_seqs) : >>> None of the strings in your circ_seqs argument match your seqnames. >>> >>>> seqinfo(txdb) >>> Seqinfo of length 32 >>> seqnames seqlengths isCircular genome >>> chr13 <na> FALSE <na> >>> chr9 <na> FALSE <na> >>> chr6 <na> FALSE <na> >>> chrX <na> FALSE <na> >>> chr17 <na> FALSE <na> >>> chr2 <na> FALSE <na> >>> chr7 <na> FALSE <na> >>> chr18 <na> FALSE <na> >>> chr8 <na> FALSE <na> >>> ... ... ... ... >>> chrY_random <na> FALSE <na> >>> chrX_random <na> FALSE <na> >>> chr5_random <na> FALSE <na> >>> chr4_random <na> FALSE <na> >>> chrY <na> FALSE <na> >>> chr7_random <na> FALSE <na> >>> chr17_random <na> FALSE <na> >>> chr13_random <na> FALSE <na> >>> chr1_random <na> FALSE <na> >>> >>> >>> >>> >>> What am I doing wrong in my original call to makeTranscriptDbFromGFF? >>> >>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>> exonRankAttributeName = "exon_number", >>> chrominfo = chrominfo, >>> dataSource = "UCSC", >>> species = "Mus musculus") >>> >>> Why am I getting this unfair error message? >>> Thank you for your help >>> Ugo >>> >>> >>> >>> >>> >>> >>> >>> >>> FYI, this is my annFile (My gtf annotation file was downloaded together >>> with a fasta file containing the mouse genome from UCSC): >>> >>>> annFile >>> GRanges with 595632 ranges and 9 metadata columns: >>> seqnames ranges strand | source type >>> score phase >>> <rle> <iranges> <rle> | <factor> <factor> >>> <numeric> <integer> >>> [1] chr1 [3204563, 3207049] - | unknown exon >>> <na> <na> >>> [2] chr1 [3206103, 3206105] - | unknown stop_codon >>> <na> <na> >>> [3] chr1 [3206106, 3207049] - | unknown CDS >>> <na> 2 >>> [4] chr1 [3411783, 3411982] - | unknown CDS >>> <na> 1 >>> [5] chr1 [3411783, 3411982] - | unknown exon >>> <na> <na> >>> ... ... ... ... ... ... ... >>> ... ... >>> [595628] chrY_random [54422360, 54422362] + | unknown stop_codon >>> <na> <na> >>> [595629] chrY_random [58501955, 58502946] + | unknown exon >>> <na> <na> >>> [595630] chrY_random [58502132, 58502812] + | unknown CDS >>> <na> 0 >>> [595631] chrY_random [58502132, 58502134] + | unknown start_codon >>> <na> <na> >>> [595632] chrY_random [58502813, 58502815] + | unknown stop_codon >>> <na> <na> >>> gene_id transcript_id gene_name p_id tss_id >>> <character> <character> <character> <character> <character> >>> [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >>> [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >>> [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >>> [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >>> [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881 >>> ... ... ... ... ... ... >>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491 >>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342 >>> --- >>> seqlengths: >>> chr1 chr10 chr11 chr12 ... chrX_random >>> chrY chrY_random >>> NA NA NA NA ... NA >>> NA NA >>> >>> >>>> sessionInfo() >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>> Biostrings_2.28.0 >>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>> GenomicRanges_1.12.2 >>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>> BSgenome_1.28.0 >>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>> lattice_0.20-15 >>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>> stats4_3.0.0 >>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 4.6 years ago by Marc Carlson7.2k
Dear Carl, Thank you very much; it makes sense now. To quantify gene expression of my RNASeq samples I use your TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to the BSgenome.Mmusculus.UCSC.mm9 genome. Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I wanted to use their annotation. Anyway, I have now a general question. For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes > genome<- scanFaIndex('genome.fa') > seqlengths(genome) chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 129993255 121843856 121257530 120284312 125194864 103494974 98319150 95272651 90772031 61342430 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM 197195432 181748087 159599783 155630120 152537259 149517037 152524553 131738871 124076172 16299 chrX chrY 166650296 15902555 For your TxDb.Mmusculus.UCSC.mm9.knownGene > seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 197195432 181748087 159599783 155630120 152537259 149517037 152524553 131738871 124076172 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 129993255 121843856 121257530 120284312 125194864 103494974 98319150 95272651 90772031 chr19 chrX chrY chrM chr1_random chr3_random chr4_random chr5_random chr7_random 61342430 166650296 15902555 16299 1231697 41899 160594 357350 362490 chr8_random chr9_random chr13_random chr16_random chr17_random chrX_random chrY_random chrUn_random 849593 449403 400311 3994 628739 1785075 58682461 5900358 So. I want to filter out the '_random' stuff; is this the only and right way to do it? > ann<- TxDb.Mmusculus.UCSC.mm9.knownGene > isActiveSeq(ann)[seqlevels(ann)] <- FALSE > isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, + "chr12"=TRUE,"chr13"=TRUE, + "chr14"=TRUE,"chr15"=TRUE, + "chr16"=TRUE, "chr17"=TRUE, + "chr18"=TRUE, "chr19"=TRUE, + "chr1"=TRUE, "chr2"=TRUE, + "chr3"=TRUE, "chr4"=TRUE, + "chr5"=TRUE, "chr6"=TRUE, + "chr7"=TRUE, "chr8"=TRUE, + "chr9"=TRUE, "chrM"=TRUE, + "chrX"=TRUE, "chrY"=TRUE) > Then get my gene info this way? > genesInfo<- exons(ann, columns='gene_id') How can I add also gene names or symbols? Thank you again for your help. Ugo > From: Marc Carlson <mcarlson at="" fhcrc.org=""> > Date: Thu, 02 May 2013 15:52:38 -0700 > To: Ugo Borello <ugo.borello at="" inserm.fr=""> > Cc: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] MakeTranscriptDbFromGFF > > Hi Ugo, > > The 15 GB tarball you sent me to contains several different GTF files > for genes. I grabbed this one as it seemed to be the most recent: > > Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-2 4/Genes/ge > nes.gtf > > So looking at this file I can reproduce the problem you mentioned. And > it shows me two problems. The 1st problem is that the only field that > seems to contain any information about exon positions is called phase > (and not "exon_number" as was in the code I see from before). There is > not actually any field called "exon_number" in this file. Either way, > one thing you can check is to make sure that the string you give here is > the same as the appropriate field name that is used by the file. There > is no way to know this information in advance since GTF does not specify > how to encode this information (and in fact the information is entirely > optional). > > The second problem is that even "phase" can't work right now since the > authors of this gtf file have decided to only associate the exon rank > information only with CDS and never with exons features. So there is > not any actual 'exon' position information in this file, only > information for CDS positions. Now that I see people doing these files > in this way, I plan to enhance the parser so that it can process files > of this kind. > > > Is there a reason why you wanted to use this file and not the data > contained in this package here? > > http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb. Mmusculus. > UCSC.mm9.knownGene.html > > > Marc > > > > On 05/02/2013 02:59 AM, Ugo Borello wrote: >> Dear Marc >> Sorry I was not precise on the origin of the gtf annotation file; I got the >> gtf file from here: >> http://tophat.cbcb.umd.edu/igenomes.shtml >> >> And more precisely from the Mus musculus/UCSC/mm9 folder >> Here the description of the content of the folder: >> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >> >> I realized reading the README.txt file that actually the genes.gtf file I >> used is the Ensembl annotation of the mm9 release. >> So, I changed dataSource = "Ensembl" in the function call and I got the same >> error message: >> Error in data.frame(..., check.names = FALSE) : >> arguments imply differing number of rows: 541775, 0 >> >> At the end of my previous email you have the result of calling: >> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >> >> >> Thank you >> >> Ugo >> >>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>> Date: Wed, 01 May 2013 15:33:02 -0700 >>> To: <bioconductor at="" r-project.org=""> >>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>> >>> Hi Ugo, >>> >>> Which UCSC file was it that you were trying to process? >>> >>> >>> Marc >>> >>> >>> >>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>> Good morning, >>>> >>>> I have a little problem creating a TranscriptDb object using the function >>>> makeTranscriptDbFromGFF. I want to use this annotation to count the >>>> overlaps >>>> of my genomic alignments with genes. >>>> >>>> >>>> I ran: >>>> >>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>> + exonRankAttributeName = "exon_number", >>>> + chrominfo = chrominfo, >>>> + dataSource = "UCSC", >>>> + species = "Mus musculus") >>>> >>>> And I got this error message: >>>> Error in data.frame(..., check.names = FALSE) : >>>> arguments imply differing number of rows: 541775, 0 >>>> >>>> >>>> "chrominfo" was (info retrieved from the fasta genome file): >>>> >>>>> chrominfo >>>> chrom length is_circular >>>> 1 chr10 129993255 FALSE >>>> 2 chr11 121843856 FALSE >>>> 3 chr12 121257530 FALSE >>>> 4 chr13 120284312 FALSE >>>> 5 chr14 125194864 FALSE >>>> 6 chr15 103494974 FALSE >>>> 7 chr16 98319150 FALSE >>>> 8 chr17 95272651 FALSE >>>> 9 chr18 90772031 FALSE >>>> 10 chr19 61342430 FALSE >>>> 11 chr1 197195432 FALSE >>>> 12 chr2 181748087 FALSE >>>> 13 chr3 159599783 FALSE >>>> 14 chr4 155630120 FALSE >>>> 15 chr5 152537259 FALSE >>>> 16 chr6 149517037 FALSE >>>> 17 chr7 152524553 FALSE >>>> 18 chr8 131738871 FALSE >>>> 19 chr9 124076172 FALSE >>>> 20 chrM 16299 TRUE >>>> 21 chrX 166650296 FALSE >>>> 22 chrY 15902555 FALSE >>>> >>>> >>>> I ran it again without the "exonRankAttributeName" argument and I got: >>>> >>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>> + chrominfo = chrominfo, >>>> + dataSource = "UCSC", >>>> + species = "Mus musculus") >>>> extracting transcript information >>>> Estimating transcript ranges. >>>> Extracting gene IDs >>>> Processing splicing information for gtf file. >>>> Deducing exon rank from relative coordinates provided >>>> Prepare the 'metadata' data frame ... metadata: OK >>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", >>>> : >>>> all the values in 'transcripts$tx_chrom' must be present in >>>> 'chrominfo$chrom' >>>> In addition: Warning message: >>>> In .deduceExonRankings(exs, format = "gtf") : >>>> Infering Exon Rankings. If this is not what you expected, then please >>>> be >>>> sure that you have provided a valid attribute for exonRankAttributeName >>>> >>>> >>>> Without the "chrominfo" argument I got the same error message as the first >>>> time: >>>> >>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>> + exonRankAttributeName = "exon_number", >>>> + dataSource = "UCSC", >>>> + species = "Mus musculus") >>>> >>>> Error in data.frame(..., check.names = FALSE) : >>>> arguments imply differing number of rows: 541775, 0 >>>> >>>> >>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>> "chrominfo" arguments it worked but the warning reminded me of the >>>> "exonRankAttributeName" argument and the chromosome names are now different >>>> from the ones in the genome file and there is no info on their length >>>> >>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>> + dataSource = "UCSC", >>>> + species = "Mus musculus") >>>> extracting transcript information >>>> Estimating transcript ranges. >>>> Extracting gene IDs >>>> Processing splicing information for gtf file. >>>> Deducing exon rank from relative coordinates provided >>>> Prepare the 'metadata' data frame ... metadata: OK >>>> Now generating chrominfo from available sequence names. No chromosome >>>> length >>>> information is available. >>>> Warning messages: >>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>> Infering Exon Rankings. If this is not what you expected, then please >>>> be >>>> sure that you have provided a valid attribute for exonRankAttributeName >>>> 2: In matchCircularity(chroms, circ_seqs) : >>>> None of the strings in your circ_seqs argument match your seqnames. >>>> >>>>> seqinfo(txdb) >>>> Seqinfo of length 32 >>>> seqnames seqlengths isCircular genome >>>> chr13 <na> FALSE <na> >>>> chr9 <na> FALSE <na> >>>> chr6 <na> FALSE <na> >>>> chrX <na> FALSE <na> >>>> chr17 <na> FALSE <na> >>>> chr2 <na> FALSE <na> >>>> chr7 <na> FALSE <na> >>>> chr18 <na> FALSE <na> >>>> chr8 <na> FALSE <na> >>>> ... ... ... ... >>>> chrY_random <na> FALSE <na> >>>> chrX_random <na> FALSE <na> >>>> chr5_random <na> FALSE <na> >>>> chr4_random <na> FALSE <na> >>>> chrY <na> FALSE <na> >>>> chr7_random <na> FALSE <na> >>>> chr17_random <na> FALSE <na> >>>> chr13_random <na> FALSE <na> >>>> chr1_random <na> FALSE <na> >>>> >>>> >>>> >>>> >>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF? >>>> >>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>> exonRankAttributeName = "exon_number", >>>> chrominfo = chrominfo, >>>> dataSource = "UCSC", >>>> species = "Mus musculus") >>>> >>>> Why am I getting this unfair error message? >>>> Thank you for your help >>>> Ugo >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> FYI, this is my annFile (My gtf annotation file was downloaded together >>>> with a fasta file containing the mouse genome from UCSC): >>>> >>>>> annFile >>>> GRanges with 595632 ranges and 9 metadata columns: >>>> seqnames ranges strand | source >>>> type >>>> score phase >>>> <rle> <iranges> <rle> | <factor> >>>> <factor> >>>> <numeric> <integer> >>>> [1] chr1 [3204563, 3207049] - | unknown >>>> exon >>>> <na> <na> >>>> [2] chr1 [3206103, 3206105] - | unknown >>>> stop_codon >>>> <na> <na> >>>> [3] chr1 [3206106, 3207049] - | unknown >>>> CDS >>>> <na> 2 >>>> [4] chr1 [3411783, 3411982] - | unknown >>>> CDS >>>> <na> 1 >>>> [5] chr1 [3411783, 3411982] - | unknown >>>> exon >>>> <na> <na> >>>> ... ... ... ... ... ... >>>> ... >>>> ... ... >>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>> stop_codon >>>> <na> <na> >>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>> exon >>>> <na> <na> >>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>> CDS >>>> <na> 0 >>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>> start_codon >>>> <na> <na> >>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>> stop_codon >>>> <na> <na> >>>> gene_id transcript_id gene_name p_id >>>> tss_id >>>> <character> <character> <character> <character> >>>> <character> >>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>> TSS1881 >>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>> TSS1881 >>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>> TSS1881 >>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>> TSS1881 >>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>> TSS1881 >>>> ... ... ... ... ... >>>> ... >>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>> TSS19491 >>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>> TSS4342 >>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>> TSS4342 >>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>> TSS4342 >>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>> TSS4342 >>>> --- >>>> seqlengths: >>>> chr1 chr10 chr11 chr12 ... chrX_random >>>> chrY chrY_random >>>> NA NA NA NA ... NA >>>> NA NA >>>> >>>> >>>>> sessionInfo() >>>> R version 3.0.0 (2013-04-03) >>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> base >>>> >>>> other attached packages: >>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>> Biostrings_2.28.0 >>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>> GenomicRanges_1.12.2 >>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>> BSgenome_1.28.0 >>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>> lattice_0.20-15 >>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>> stats4_3.0.0 >>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLYlink written 4.6 years ago by Ugo Borello340
Hi Ugo, On 05/03/2013 04:00 AM, Ugo Borello wrote: > Dear Carl, > Thank you very much; it makes sense now. > > To quantify gene expression of my RNASeq samples I use your > TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to > the BSgenome.Mmusculus.UCSC.mm9 genome. > Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I > wanted to use their annotation. > > > Anyway, I have now a general question. > For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes >> genome<- scanFaIndex('genome.fa') >> seqlengths(genome) > chr10 chr11 chr12 chr13 chr14 chr15 chr16 > chr17 chr18 chr19 > 129993255 121843856 121257530 120284312 125194864 103494974 98319150 > 95272651 90772031 61342430 > chr1 chr2 chr3 chr4 chr5 chr6 chr7 > chr8 chr9 chrM > 197195432 181748087 159599783 155630120 152537259 149517037 152524553 > 131738871 124076172 16299 > chrX chrY > 166650296 15902555 > > For your TxDb.Mmusculus.UCSC.mm9.knownGene >> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) > chr1 chr2 chr3 chr4 chr5 > chr6 chr7 chr8 chr9 > 197195432 181748087 159599783 155630120 152537259 > 149517037 152524553 131738871 124076172 > chr10 chr11 chr12 chr13 chr14 > chr15 chr16 chr17 chr18 > 129993255 121843856 121257530 120284312 125194864 > 103494974 98319150 95272651 90772031 > chr19 chrX chrY chrM chr1_random > chr3_random chr4_random chr5_random chr7_random > 61342430 166650296 15902555 16299 1231697 > 41899 160594 357350 362490 > chr8_random chr9_random chr13_random chr16_random chr17_random > chrX_random chrY_random chrUn_random > 849593 449403 400311 3994 628739 > 1785075 58682461 5900358 > > > So. I want to filter out the '_random' stuff; is this the only and right way > to do it? It depends on whether you want to use the _random stuff. My impression of how people use this is that most people don't. So they would use isActiveSeq() to toggle those off. >> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene >> isActiveSeq(ann)[seqlevels(ann)] <- FALSE >> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, > + "chr12"=TRUE,"chr13"=TRUE, > + "chr14"=TRUE,"chr15"=TRUE, > + "chr16"=TRUE, "chr17"=TRUE, > + "chr18"=TRUE, "chr19"=TRUE, > + "chr1"=TRUE, "chr2"=TRUE, > + "chr3"=TRUE, "chr4"=TRUE, > + "chr5"=TRUE, "chr6"=TRUE, > + "chr7"=TRUE, "chr8"=TRUE, > + "chr9"=TRUE, "chrM"=TRUE, > + "chrX"=TRUE, "chrY"=TRUE) > Then get my gene info this way? >> genesInfo<- exons(ann, columns='gene_id') > How can I add also gene names or symbols? Again, it depends on what you want to do. If you are dealing with just a TranscriptDb like this, then there are not gene symbols attached already. So for that object yes, you would need to do it like that. Now if you wanted gene symbols they are pretty easy to get. You can go and get gene symbols by loading an org package like this: library(org.Mm.eg.db) cols(org.Mm.eg.db) And then you could use select() to retrieve those (by using the gene IDs as keys from your TranscriptDb object). But some people find this extra step to be inconvenient, so I have created another route. And that is to use a OrganismDb object. There is a package for this already that I will use to demo here: library(Mus.musculus) cols(Mus.musculus) You will notice that this kind of object has all the stuff you want in one place. I suspect that you will find that to be a lot more convenient: This basically means that you can do the thing you were interested in like this: genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL")) *But* in your case, you are using mm9, so you will want to make a custom object (the Mus.musculus object is based on mm10). This is not hard to do, but it is an extra step. You can read how to do it in section 2 of the following vignette: http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi/i nst/doc/OrganismDbi.pdf > > > > Thank you again for your help. > > Ugo > > > > >> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >> Date: Thu, 02 May 2013 15:52:38 -0700 >> To: Ugo Borello <ugo.borello at="" inserm.fr=""> >> Cc: <bioconductor at="" r-project.org=""> >> Subject: Re: [BioC] MakeTranscriptDbFromGFF >> >> Hi Ugo, >> >> The 15 GB tarball you sent me to contains several different GTF files >> for genes. I grabbed this one as it seemed to be the most recent: >> >> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01- 24/Genes/ge >> nes.gtf >> >> So looking at this file I can reproduce the problem you mentioned. And >> it shows me two problems. The 1st problem is that the only field that >> seems to contain any information about exon positions is called phase >> (and not "exon_number" as was in the code I see from before). There is >> not actually any field called "exon_number" in this file. Either way, >> one thing you can check is to make sure that the string you give here is >> the same as the appropriate field name that is used by the file. There >> is no way to know this information in advance since GTF does not specify >> how to encode this information (and in fact the information is entirely >> optional). >> >> The second problem is that even "phase" can't work right now since the >> authors of this gtf file have decided to only associate the exon rank >> information only with CDS and never with exons features. So there is >> not any actual 'exon' position information in this file, only >> information for CDS positions. Now that I see people doing these files >> in this way, I plan to enhance the parser so that it can process files >> of this kind. >> >> >> Is there a reason why you wanted to use this file and not the data >> contained in this package here? >> >> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb .Mmusculus. >> UCSC.mm9.knownGene.html >> >> >> Marc >> >> >> >> On 05/02/2013 02:59 AM, Ugo Borello wrote: >>> Dear Marc >>> Sorry I was not precise on the origin of the gtf annotation file; I got the >>> gtf file from here: >>> http://tophat.cbcb.umd.edu/igenomes.shtml >>> >>> And more precisely from the Mus musculus/UCSC/mm9 folder >>> Here the description of the content of the folder: >>> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >>> >>> I realized reading the README.txt file that actually the genes.gtf file I >>> used is the Ensembl annotation of the mm9 release. >>> So, I changed dataSource = "Ensembl" in the function call and I got the same >>> error message: >>> Error in data.frame(..., check.names = FALSE) : >>> arguments imply differing number of rows: 541775, 0 >>> >>> At the end of my previous email you have the result of calling: >>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >>> >>> >>> Thank you >>> >>> Ugo >>> >>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>> Date: Wed, 01 May 2013 15:33:02 -0700 >>>> To: <bioconductor at="" r-project.org=""> >>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>> >>>> Hi Ugo, >>>> >>>> Which UCSC file was it that you were trying to process? >>>> >>>> >>>> Marc >>>> >>>> >>>> >>>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>>> Good morning, >>>>> >>>>> I have a little problem creating a TranscriptDb object using the function >>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the >>>>> overlaps >>>>> of my genomic alignments with genes. >>>>> >>>>> >>>>> I ran: >>>>> >>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>> + exonRankAttributeName = "exon_number", >>>>> + chrominfo = chrominfo, >>>>> + dataSource = "UCSC", >>>>> + species = "Mus musculus") >>>>> >>>>> And I got this error message: >>>>> Error in data.frame(..., check.names = FALSE) : >>>>> arguments imply differing number of rows: 541775, 0 >>>>> >>>>> >>>>> "chrominfo" was (info retrieved from the fasta genome file): >>>>> >>>>>> chrominfo >>>>> chrom length is_circular >>>>> 1 chr10 129993255 FALSE >>>>> 2 chr11 121843856 FALSE >>>>> 3 chr12 121257530 FALSE >>>>> 4 chr13 120284312 FALSE >>>>> 5 chr14 125194864 FALSE >>>>> 6 chr15 103494974 FALSE >>>>> 7 chr16 98319150 FALSE >>>>> 8 chr17 95272651 FALSE >>>>> 9 chr18 90772031 FALSE >>>>> 10 chr19 61342430 FALSE >>>>> 11 chr1 197195432 FALSE >>>>> 12 chr2 181748087 FALSE >>>>> 13 chr3 159599783 FALSE >>>>> 14 chr4 155630120 FALSE >>>>> 15 chr5 152537259 FALSE >>>>> 16 chr6 149517037 FALSE >>>>> 17 chr7 152524553 FALSE >>>>> 18 chr8 131738871 FALSE >>>>> 19 chr9 124076172 FALSE >>>>> 20 chrM 16299 TRUE >>>>> 21 chrX 166650296 FALSE >>>>> 22 chrY 15902555 FALSE >>>>> >>>>> >>>>> I ran it again without the "exonRankAttributeName" argument and I got: >>>>> >>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>> + chrominfo = chrominfo, >>>>> + dataSource = "UCSC", >>>>> + species = "Mus musculus") >>>>> extracting transcript information >>>>> Estimating transcript ranges. >>>>> Extracting gene IDs >>>>> Processing splicing information for gtf file. >>>>> Deducing exon rank from relative coordinates provided >>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", >>>>> : >>>>> all the values in 'transcripts$tx_chrom' must be present in >>>>> 'chrominfo$chrom' >>>>> In addition: Warning message: >>>>> In .deduceExonRankings(exs, format = "gtf") : >>>>> Infering Exon Rankings. If this is not what you expected, then please >>>>> be >>>>> sure that you have provided a valid attribute for exonRankAttributeName >>>>> >>>>> >>>>> Without the "chrominfo" argument I got the same error message as the first >>>>> time: >>>>> >>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>> + exonRankAttributeName = "exon_number", >>>>> + dataSource = "UCSC", >>>>> + species = "Mus musculus") >>>>> >>>>> Error in data.frame(..., check.names = FALSE) : >>>>> arguments imply differing number of rows: 541775, 0 >>>>> >>>>> >>>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>>> "chrominfo" arguments it worked but the warning reminded me of the >>>>> "exonRankAttributeName" argument and the chromosome names are now different >>>>> from the ones in the genome file and there is no info on their length >>>>> >>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>> + dataSource = "UCSC", >>>>> + species = "Mus musculus") >>>>> extracting transcript information >>>>> Estimating transcript ranges. >>>>> Extracting gene IDs >>>>> Processing splicing information for gtf file. >>>>> Deducing exon rank from relative coordinates provided >>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>> Now generating chrominfo from available sequence names. No chromosome >>>>> length >>>>> information is available. >>>>> Warning messages: >>>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>>> Infering Exon Rankings. If this is not what you expected, then please >>>>> be >>>>> sure that you have provided a valid attribute for exonRankAttributeName >>>>> 2: In matchCircularity(chroms, circ_seqs) : >>>>> None of the strings in your circ_seqs argument match your seqnames. >>>>> >>>>>> seqinfo(txdb) >>>>> Seqinfo of length 32 >>>>> seqnames seqlengths isCircular genome >>>>> chr13 <na> FALSE <na> >>>>> chr9 <na> FALSE <na> >>>>> chr6 <na> FALSE <na> >>>>> chrX <na> FALSE <na> >>>>> chr17 <na> FALSE <na> >>>>> chr2 <na> FALSE <na> >>>>> chr7 <na> FALSE <na> >>>>> chr18 <na> FALSE <na> >>>>> chr8 <na> FALSE <na> >>>>> ... ... ... ... >>>>> chrY_random <na> FALSE <na> >>>>> chrX_random <na> FALSE <na> >>>>> chr5_random <na> FALSE <na> >>>>> chr4_random <na> FALSE <na> >>>>> chrY <na> FALSE <na> >>>>> chr7_random <na> FALSE <na> >>>>> chr17_random <na> FALSE <na> >>>>> chr13_random <na> FALSE <na> >>>>> chr1_random <na> FALSE <na> >>>>> >>>>> >>>>> >>>>> >>>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF? >>>>> >>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>> exonRankAttributeName = "exon_number", >>>>> chrominfo = chrominfo, >>>>> dataSource = "UCSC", >>>>> species = "Mus musculus") >>>>> >>>>> Why am I getting this unfair error message? >>>>> Thank you for your help >>>>> Ugo >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> FYI, this is my annFile (My gtf annotation file was downloaded together >>>>> with a fasta file containing the mouse genome from UCSC): >>>>> >>>>>> annFile >>>>> GRanges with 595632 ranges and 9 metadata columns: >>>>> seqnames ranges strand | source >>>>> type >>>>> score phase >>>>> <rle> <iranges> <rle> | <factor> >>>>> <factor> >>>>> <numeric> <integer> >>>>> [1] chr1 [3204563, 3207049] - | unknown >>>>> exon >>>>> <na> <na> >>>>> [2] chr1 [3206103, 3206105] - | unknown >>>>> stop_codon >>>>> <na> <na> >>>>> [3] chr1 [3206106, 3207049] - | unknown >>>>> CDS >>>>> <na> 2 >>>>> [4] chr1 [3411783, 3411982] - | unknown >>>>> CDS >>>>> <na> 1 >>>>> [5] chr1 [3411783, 3411982] - | unknown >>>>> exon >>>>> <na> <na> >>>>> ... ... ... ... ... ... >>>>> ... >>>>> ... ... >>>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>>> stop_codon >>>>> <na> <na> >>>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>>> exon >>>>> <na> <na> >>>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>>> CDS >>>>> <na> 0 >>>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>>> start_codon >>>>> <na> <na> >>>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>>> stop_codon >>>>> <na> <na> >>>>> gene_id transcript_id gene_name p_id >>>>> tss_id >>>>> <character> <character> <character> <character> >>>>> <character> >>>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>>> TSS1881 >>>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>>> TSS1881 >>>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>>> TSS1881 >>>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>>> TSS1881 >>>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>>> TSS1881 >>>>> ... ... ... ... ... >>>>> ... >>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>>> TSS19491 >>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>> TSS4342 >>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>> TSS4342 >>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>> TSS4342 >>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>> TSS4342 >>>>> --- >>>>> seqlengths: >>>>> chr1 chr10 chr11 chr12 ... chrX_random >>>>> chrY chrY_random >>>>> NA NA NA NA ... NA >>>>> NA NA >>>>> >>>>> >>>>>> sessionInfo() >>>>> R version 3.0.0 (2013-04-03) >>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>> >>>>> locale: >>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>> base >>>>> >>>>> other attached packages: >>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>>> Biostrings_2.28.0 >>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>>> GenomicRanges_1.12.2 >>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>>> BSgenome_1.28.0 >>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>>> lattice_0.20-15 >>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>>> stats4_3.0.0 >>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 4.6 years ago by Marc Carlson7.2k
Thank you very much, Marc. I will work on it. Ugo Quoting Marc Carlson <mcarlson at="" fhcrc.org="">: > Hi Ugo, > > > On 05/03/2013 04:00 AM, Ugo Borello wrote: >> Dear Carl, >> Thank you very much; it makes sense now. >> >> To quantify gene expression of my RNASeq samples I use your >> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to >> the BSgenome.Mmusculus.UCSC.mm9 genome. >> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I >> wanted to use their annotation. >> >> >> Anyway, I have now a general question. >> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes >>> genome<- scanFaIndex('genome.fa') >>> seqlengths(genome) >> chr10 chr11 chr12 chr13 chr14 chr15 chr16 >> chr17 chr18 chr19 >> 129993255 121843856 121257530 120284312 125194864 103494974 98319150 >> 95272651 90772031 61342430 >> chr1 chr2 chr3 chr4 chr5 chr6 chr7 >> chr8 chr9 chrM >> 197195432 181748087 159599783 155630120 152537259 149517037 152524553 >> 131738871 124076172 16299 >> chrX chrY >> 166650296 15902555 >> >> For your TxDb.Mmusculus.UCSC.mm9.knownGene >>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) >> chr1 chr2 chr3 chr4 chr5 >> chr6 chr7 chr8 chr9 >> 197195432 181748087 159599783 155630120 152537259 >> 149517037 152524553 131738871 124076172 >> chr10 chr11 chr12 chr13 chr14 >> chr15 chr16 chr17 chr18 >> 129993255 121843856 121257530 120284312 125194864 >> 103494974 98319150 95272651 90772031 >> chr19 chrX chrY chrM chr1_random >> chr3_random chr4_random chr5_random chr7_random >> 61342430 166650296 15902555 16299 1231697 >> 41899 160594 357350 362490 >> chr8_random chr9_random chr13_random chr16_random chr17_random >> chrX_random chrY_random chrUn_random >> 849593 449403 400311 3994 628739 >> 1785075 58682461 5900358 >> >> >> So. I want to filter out the '_random' stuff; is this the only and right way >> to do it? > > It depends on whether you want to use the _random stuff. My impression > of how people use this is that most people don't. So they would use > isActiveSeq() to toggle those off. > >>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene >>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE >>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, >> + "chr12"=TRUE,"chr13"=TRUE, >> + "chr14"=TRUE,"chr15"=TRUE, >> + "chr16"=TRUE, "chr17"=TRUE, >> + "chr18"=TRUE, "chr19"=TRUE, >> + "chr1"=TRUE, "chr2"=TRUE, >> + "chr3"=TRUE, "chr4"=TRUE, >> + "chr5"=TRUE, "chr6"=TRUE, >> + "chr7"=TRUE, "chr8"=TRUE, >> + "chr9"=TRUE, "chrM"=TRUE, >> + "chrX"=TRUE, "chrY"=TRUE) >> Then get my gene info this way? >>> genesInfo<- exons(ann, columns='gene_id') >> How can I add also gene names or symbols? > Again, it depends on what you want to do. If you are dealing with just > a TranscriptDb like this, then there are not gene symbols attached > already. So for that object yes, you would need to do it like that. > > > Now if you wanted gene symbols they are pretty easy to get. You can go > and get gene symbols by loading an org package like this: > > library(org.Mm.eg.db) > cols(org.Mm.eg.db) > > And then you could use select() to retrieve those (by using the gene > IDs as keys from your TranscriptDb object). But some people find this > extra step to be inconvenient, so I have created another route. And > that is to use a OrganismDb object. There is a package for this > already that I will use to demo here: > > library(Mus.musculus) > cols(Mus.musculus) > > You will notice that this kind of object has all the stuff you want in > one place. I suspect that you will find that to be a lot more > convenient: > > This basically means that you can do the thing you were interested in > like this: > > genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL")) > > *But* in your case, you are using mm9, so you will want to make a > custom object (the Mus.musculus object is based on mm10). This is not > hard to do, but it is an extra step. You can read how to do it in > section 2 of the following vignette: > > http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi /inst/doc/OrganismDbi.pdf > > > >> >> >> >> Thank you again for your help. >> >> Ugo >> >> >> >> >>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>> Date: Thu, 02 May 2013 15:52:38 -0700 >>> To: Ugo Borello <ugo.borello at="" inserm.fr=""> >>> Cc: <bioconductor at="" r-project.org=""> >>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>> >>> Hi Ugo, >>> >>> The 15 GB tarball you sent me to contains several different GTF files >>> for genes. I grabbed this one as it seemed to be the most recent: >>> >>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01 -24/Genes/ge >>> nes.gtf >>> >>> So looking at this file I can reproduce the problem you mentioned. And >>> it shows me two problems. The 1st problem is that the only field that >>> seems to contain any information about exon positions is called phase >>> (and not "exon_number" as was in the code I see from before). There is >>> not actually any field called "exon_number" in this file. Either way, >>> one thing you can check is to make sure that the string you give here is >>> the same as the appropriate field name that is used by the file. There >>> is no way to know this information in advance since GTF does not specify >>> how to encode this information (and in fact the information is entirely >>> optional). >>> >>> The second problem is that even "phase" can't work right now since the >>> authors of this gtf file have decided to only associate the exon rank >>> information only with CDS and never with exons features. So there is >>> not any actual 'exon' position information in this file, only >>> information for CDS positions. Now that I see people doing these files >>> in this way, I plan to enhance the parser so that it can process files >>> of this kind. >>> >>> >>> Is there a reason why you wanted to use this file and not the data >>> contained in this package here? >>> >>> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxD b.Mmusculus. >>> UCSC.mm9.knownGene.html >>> >>> >>> Marc >>> >>> >>> >>> On 05/02/2013 02:59 AM, Ugo Borello wrote: >>>> Dear Marc >>>> Sorry I was not precise on the origin of the gtf annotation file; >>>> I got the >>>> gtf file from here: >>>> http://tophat.cbcb.umd.edu/igenomes.shtml >>>> >>>> And more precisely from the Mus musculus/UCSC/mm9 folder >>>> Here the description of the content of the folder: >>>> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >>>> >>>> I realized reading the README.txt file that actually the genes.gtf file I >>>> used is the Ensembl annotation of the mm9 release. >>>> So, I changed dataSource = "Ensembl" in the function call and I >>>> got the same >>>> error message: >>>> Error in data.frame(..., check.names = FALSE) : >>>> arguments imply differing number of rows: 541775, 0 >>>> >>>> At the end of my previous email you have the result of calling: >>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >>>> >>>> >>>> Thank you >>>> >>>> Ugo >>>> >>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>> Date: Wed, 01 May 2013 15:33:02 -0700 >>>>> To: <bioconductor at="" r-project.org=""> >>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>> >>>>> Hi Ugo, >>>>> >>>>> Which UCSC file was it that you were trying to process? >>>>> >>>>> >>>>> Marc >>>>> >>>>> >>>>> >>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>>>> Good morning, >>>>>> >>>>>> I have a little problem creating a TranscriptDb object using >>>>>> the function >>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the >>>>>> overlaps >>>>>> of my genomic alignments with genes. >>>>>> >>>>>> >>>>>> I ran: >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + exonRankAttributeName = "exon_number", >>>>>> + chrominfo = chrominfo, >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> >>>>>> And I got this error message: >>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>> arguments imply differing number of rows: 541775, 0 >>>>>> >>>>>> >>>>>> "chrominfo" was (info retrieved from the fasta genome file): >>>>>> >>>>>>> chrominfo >>>>>> chrom length is_circular >>>>>> 1 chr10 129993255 FALSE >>>>>> 2 chr11 121843856 FALSE >>>>>> 3 chr12 121257530 FALSE >>>>>> 4 chr13 120284312 FALSE >>>>>> 5 chr14 125194864 FALSE >>>>>> 6 chr15 103494974 FALSE >>>>>> 7 chr16 98319150 FALSE >>>>>> 8 chr17 95272651 FALSE >>>>>> 9 chr18 90772031 FALSE >>>>>> 10 chr19 61342430 FALSE >>>>>> 11 chr1 197195432 FALSE >>>>>> 12 chr2 181748087 FALSE >>>>>> 13 chr3 159599783 FALSE >>>>>> 14 chr4 155630120 FALSE >>>>>> 15 chr5 152537259 FALSE >>>>>> 16 chr6 149517037 FALSE >>>>>> 17 chr7 152524553 FALSE >>>>>> 18 chr8 131738871 FALSE >>>>>> 19 chr9 124076172 FALSE >>>>>> 20 chrM 16299 TRUE >>>>>> 21 chrX 166650296 FALSE >>>>>> 22 chrY 15902555 FALSE >>>>>> >>>>>> >>>>>> I ran it again without the "exonRankAttributeName" argument and I got: >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + chrominfo = chrominfo, >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> extracting transcript information >>>>>> Estimating transcript ranges. >>>>>> Extracting gene IDs >>>>>> Processing splicing information for gtf file. >>>>>> Deducing exon rank from relative coordinates provided >>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, >>>>>> "transcripts$tx_chrom", >>>>>> : >>>>>> all the values in 'transcripts$tx_chrom' must be present in >>>>>> 'chrominfo$chrom' >>>>>> In addition: Warning message: >>>>>> In .deduceExonRankings(exs, format = "gtf") : >>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>> then please >>>>>> be >>>>>> sure that you have provided a valid attribute for exonRankAttributeName >>>>>> >>>>>> >>>>>> Without the "chrominfo" argument I got the same error message >>>>>> as the first >>>>>> time: >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + exonRankAttributeName = "exon_number", >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> >>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>> arguments imply differing number of rows: 541775, 0 >>>>>> >>>>>> >>>>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>>>> "chrominfo" arguments it worked but the warning reminded me of the >>>>>> "exonRankAttributeName" argument and the chromosome names are >>>>>> now different >>>>>> from the ones in the genome file and there is no info on their length >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> extracting transcript information >>>>>> Estimating transcript ranges. >>>>>> Extracting gene IDs >>>>>> Processing splicing information for gtf file. >>>>>> Deducing exon rank from relative coordinates provided >>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>> Now generating chrominfo from available sequence names. No chromosome >>>>>> length >>>>>> information is available. >>>>>> Warning messages: >>>>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>> then please >>>>>> be >>>>>> sure that you have provided a valid attribute for exonRankAttributeName >>>>>> 2: In matchCircularity(chroms, circ_seqs) : >>>>>> None of the strings in your circ_seqs argument match your seqnames. >>>>>> >>>>>>> seqinfo(txdb) >>>>>> Seqinfo of length 32 >>>>>> seqnames seqlengths isCircular genome >>>>>> chr13 <na> FALSE <na> >>>>>> chr9 <na> FALSE <na> >>>>>> chr6 <na> FALSE <na> >>>>>> chrX <na> FALSE <na> >>>>>> chr17 <na> FALSE <na> >>>>>> chr2 <na> FALSE <na> >>>>>> chr7 <na> FALSE <na> >>>>>> chr18 <na> FALSE <na> >>>>>> chr8 <na> FALSE <na> >>>>>> ... ... ... ... >>>>>> chrY_random <na> FALSE <na> >>>>>> chrX_random <na> FALSE <na> >>>>>> chr5_random <na> FALSE <na> >>>>>> chr4_random <na> FALSE <na> >>>>>> chrY <na> FALSE <na> >>>>>> chr7_random <na> FALSE <na> >>>>>> chr17_random <na> FALSE <na> >>>>>> chr13_random <na> FALSE <na> >>>>>> chr1_random <na> FALSE <na> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF? >>>>>> >>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> exonRankAttributeName = "exon_number", >>>>>> chrominfo = chrominfo, >>>>>> dataSource = "UCSC", >>>>>> species = "Mus musculus") >>>>>> >>>>>> Why am I getting this unfair error message? >>>>>> Thank you for your help >>>>>> Ugo >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> FYI, this is my annFile (My gtf annotation file was downloaded together >>>>>> with a fasta file containing the mouse genome from UCSC): >>>>>> >>>>>>> annFile >>>>>> GRanges with 595632 ranges and 9 metadata columns: >>>>>> seqnames ranges strand | source >>>>>> type >>>>>> score phase >>>>>> <rle> <iranges> <rle> | <factor> >>>>>> <factor> >>>>>> <numeric> <integer> >>>>>> [1] chr1 [3204563, 3207049] - | unknown >>>>>> exon >>>>>> <na> <na> >>>>>> [2] chr1 [3206103, 3206105] - | unknown >>>>>> stop_codon >>>>>> <na> <na> >>>>>> [3] chr1 [3206106, 3207049] - | unknown >>>>>> CDS >>>>>> <na> 2 >>>>>> [4] chr1 [3411783, 3411982] - | unknown >>>>>> CDS >>>>>> <na> 1 >>>>>> [5] chr1 [3411783, 3411982] - | unknown >>>>>> exon >>>>>> <na> <na> >>>>>> ... ... ... ... ... ... >>>>>> ... >>>>>> ... ... >>>>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>>>> stop_codon >>>>>> <na> <na> >>>>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>>>> exon >>>>>> <na> <na> >>>>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>>>> CDS >>>>>> <na> 0 >>>>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>>>> start_codon >>>>>> <na> <na> >>>>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>>>> stop_codon >>>>>> <na> <na> >>>>>> gene_id transcript_id gene_name p_id >>>>>> tss_id >>>>>> <character> <character> <character> <character> >>>>>> <character> >>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> ... ... ... ... ... >>>>>> ... >>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>>>> TSS19491 >>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> --- >>>>>> seqlengths: >>>>>> chr1 chr10 chr11 chr12 ... >>>>>> chrX_random >>>>>> chrY chrY_random >>>>>> NA NA NA NA ... NA >>>>>> NA NA >>>>>> >>>>>> >>>>>>> sessionInfo() >>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>>> base >>>>>> >>>>>> other attached packages: >>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>>>> Biostrings_2.28.0 >>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>>>> GenomicRanges_1.12.2 >>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>>>> BSgenome_1.28.0 >>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>>>> lattice_0.20-15 >>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>>>> stats4_3.0.0 >>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>
ADD REPLYlink written 4.6 years ago by Ugo Borello340
Dear Marc, I apologize for bothering you again, but I was intrigued by the custom Mus.musculus.mm9 object. It sounds very convenient indeed. so I tried, as you suggested: gd<-c(org.Mm.eg.db='SYMBOL', TxDb.Mmusculus.UCSC.mm9.knownGene="GENEID") ## I also tried to set org.Mm.eg.db='ENTREZID' destination <- tempfile() dir.create(destination) makeOrganismPackage(pkgname = "Mus.musculus.mm9", graphData = gd, organism = "Mus musculus", version = "1.0.0", maintainer = "Package Maintainer<maintainer at="" somewhere.com="">", author = "SomeBody", destDir = destination, license = "Artistic-2.0") and I got: Error in cbind(pkgs, keys) : number of rows of matrices must match (see arg 2) I am surely missing something here. Any suggestions? And, coming back to makeTranscriptDbFromGFF. I obtained my gtf annotation file from UCSC using genePredToGtf, as the UCSC people suggest (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format) txdb <-makeTranscriptDbFromGFF(file = 'mm9.gtf', format = "gtf", + exonRankAttributeName = "exon_number", + chrominfo = chrominfo, + dataSource = "UCSC", + species = "Mus musculus") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Prepare the 'metadata' data frame ... metadata: OK Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom", : all the values in 'transcripts$tx_chrom' must be present in 'chrominfo$chrom' Why I get this error now? > chrominfo chrom length is_circular 1 chr10 129993255 FALSE 2 chr11 121843856 FALSE 3 chr12 121257530 FALSE 4 chr13 120284312 FALSE 5 chr14 125194864 FALSE 6 chr15 103494974 FALSE 7 chr16 98319150 FALSE 8 chr17 95272651 FALSE 9 chr18 90772031 FALSE 10 chr19 61342430 FALSE 11 chr1 197195432 FALSE 12 chr2 181748087 FALSE 13 chr3 159599783 FALSE 14 chr4 155630120 FALSE 15 chr5 152537259 FALSE 16 chr6 149517037 FALSE 17 chr7 152524553 FALSE 18 chr8 131738871 FALSE 19 chr9 124076172 FALSE 20 chrM 16299 FALSE 21 chrX 166650296 FALSE 22 chrY 15902555 FALSE Thank you very much for your time and patience. Ugo P.S. This is the annotation > annFile3<- import.gff('mm9.gtf', format='gtf', asRangedData=FALSE) > annFile3 GRanges with 962651 ranges and 9 metadata columns: seqnames ranges strand | source type score phase gene_id transcript_id <rle> <iranges> <rle> | <factor> <factor> <numeric> <integer> <character> <character> [1] chr1 [3195985, 3197398] - | knownGene exon <na> <na> uc007aet.1 uc007aet.1 [2] chr1 [3203520, 3205713] - | knownGene exon <na> <na> uc007aet.1 uc007aet.1 [3] chr1 [3204563, 3207049] - | knownGene exon <na> <na> Q5GH67 uc007aeu.1 [4] chr1 [3206106, 3207049] - | knownGene CDS <na> 2 Q5GH67 uc007aeu.1 [5] chr1 [3411783, 3411982] - | knownGene exon <na> <na> Q5GH67 uc007aeu.1 ... ... ... ... ... ... ... ... ... ... ... [962647] chrY_random [58502132, 58502365] + | knownGene CDS <na> 0 Q62458 uc012htl.1 [962648] chrY_random [58502369, 58502946] + | knownGene exon <na> <na> Q62458 uc012htl.1 [962649] chrY_random [58502369, 58502812] + | knownGene CDS <na> 0 Q62458 uc012htl.1 [962650] chrY_random [58502132, 58502134] + | knownGene start_codon <na> 0 Q62458 uc012htl.1 [962651] chrY_random [58502813, 58502815] + | knownGene stop_codon <na> 0 Q62458 uc012htl.1 exon_number exon_id gene_name <numeric> <character> <character> [1] 1 uc007aet.1.1 <na> [2] 2 uc007aet.1.2 <na> [3] 1 uc007aeu.1.1 Q5GH67 [4] 1 uc007aeu.1.1 Q5GH67 [5] 2 uc007aeu.1.2 Q5GH67 ... ... ... ... [962647] 1 uc012htl.1.1 Q62458 [962648] 2 uc012htl.1.2 Q62458 [962649] 2 uc012htl.1.2 Q62458 [962650] 1 uc012htl.1.1 Q62458 [962651] 1 uc012htl.1.1 Q62458 --- seqlengths: chr1 chr10 chr11 chr12 chr13 ... chrX chrX_random chrY chrY_random NA NA NA NA NA ... NA Quoting Marc Carlson <mcarlson at="" fhcrc.org="">: > Hi Ugo, > > > On 05/03/2013 04:00 AM, Ugo Borello wrote: >> Dear Carl, >> Thank you very much; it makes sense now. >> >> To quantify gene expression of my RNASeq samples I use your >> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to >> the BSgenome.Mmusculus.UCSC.mm9 genome. >> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I >> wanted to use their annotation. >> >> >> Anyway, I have now a general question. >> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes >>> genome<- scanFaIndex('genome.fa') >>> seqlengths(genome) >> chr10 chr11 chr12 chr13 chr14 chr15 chr16 >> chr17 chr18 chr19 >> 129993255 121843856 121257530 120284312 125194864 103494974 98319150 >> 95272651 90772031 61342430 >> chr1 chr2 chr3 chr4 chr5 chr6 chr7 >> chr8 chr9 chrM >> 197195432 181748087 159599783 155630120 152537259 149517037 152524553 >> 131738871 124076172 16299 >> chrX chrY >> 166650296 15902555 >> >> For your TxDb.Mmusculus.UCSC.mm9.knownGene >>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) >> chr1 chr2 chr3 chr4 chr5 >> chr6 chr7 chr8 chr9 >> 197195432 181748087 159599783 155630120 152537259 >> 149517037 152524553 131738871 124076172 >> chr10 chr11 chr12 chr13 chr14 >> chr15 chr16 chr17 chr18 >> 129993255 121843856 121257530 120284312 125194864 >> 103494974 98319150 95272651 90772031 >> chr19 chrX chrY chrM chr1_random >> chr3_random chr4_random chr5_random chr7_random >> 61342430 166650296 15902555 16299 1231697 >> 41899 160594 357350 362490 >> chr8_random chr9_random chr13_random chr16_random chr17_random >> chrX_random chrY_random chrUn_random >> 849593 449403 400311 3994 628739 >> 1785075 58682461 5900358 >> >> >> So. I want to filter out the '_random' stuff; is this the only and right way >> to do it? > > It depends on whether you want to use the _random stuff. My impression > of how people use this is that most people don't. So they would use > isActiveSeq() to toggle those off. > >>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene >>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE >>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, >> + "chr12"=TRUE,"chr13"=TRUE, >> + "chr14"=TRUE,"chr15"=TRUE, >> + "chr16"=TRUE, "chr17"=TRUE, >> + "chr18"=TRUE, "chr19"=TRUE, >> + "chr1"=TRUE, "chr2"=TRUE, >> + "chr3"=TRUE, "chr4"=TRUE, >> + "chr5"=TRUE, "chr6"=TRUE, >> + "chr7"=TRUE, "chr8"=TRUE, >> + "chr9"=TRUE, "chrM"=TRUE, >> + "chrX"=TRUE, "chrY"=TRUE) >> Then get my gene info this way? >>> genesInfo<- exons(ann, columns='gene_id') >> How can I add also gene names or symbols? > Again, it depends on what you want to do. If you are dealing with just > a TranscriptDb like this, then there are not gene symbols attached > already. So for that object yes, you would need to do it like that. > > > Now if you wanted gene symbols they are pretty easy to get. You can go > and get gene symbols by loading an org package like this: > > library(org.Mm.eg.db) > cols(org.Mm.eg.db) > > And then you could use select() to retrieve those (by using the gene > IDs as keys from your TranscriptDb object). But some people find this > extra step to be inconvenient, so I have created another route. And > that is to use a OrganismDb object. There is a package for this > already that I will use to demo here: > > library(Mus.musculus) > cols(Mus.musculus) > > You will notice that this kind of object has all the stuff you want in > one place. I suspect that you will find that to be a lot more > convenient: > > This basically means that you can do the thing you were interested in > like this: > > genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL")) > > *But* in your case, you are using mm9, so you will want to make a > custom object (the Mus.musculus object is based on mm10). This is not > hard to do, but it is an extra step. You can read how to do it in > section 2 of the following vignette: > > http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi /inst/doc/OrganismDbi.pdf > > > >> >> >> >> Thank you again for your help. >> >> Ugo >> >> >> >> >>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>> Date: Thu, 02 May 2013 15:52:38 -0700 >>> To: Ugo Borello <ugo.borello at="" inserm.fr=""> >>> Cc: <bioconductor at="" r-project.org=""> >>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>> >>> Hi Ugo, >>> >>> The 15 GB tarball you sent me to contains several different GTF files >>> for genes. I grabbed this one as it seemed to be the most recent: >>> >>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01 -24/Genes/ge >>> nes.gtf >>> >>> So looking at this file I can reproduce the problem you mentioned. And >>> it shows me two problems. The 1st problem is that the only field that >>> seems to contain any information about exon positions is called phase >>> (and not "exon_number" as was in the code I see from before). There is >>> not actually any field called "exon_number" in this file. Either way, >>> one thing you can check is to make sure that the string you give here is >>> the same as the appropriate field name that is used by the file. There >>> is no way to know this information in advance since GTF does not specify >>> how to encode this information (and in fact the information is entirely >>> optional). >>> >>> The second problem is that even "phase" can't work right now since the >>> authors of this gtf file have decided to only associate the exon rank >>> information only with CDS and never with exons features. So there is >>> not any actual 'exon' position information in this file, only >>> information for CDS positions. Now that I see people doing these files >>> in this way, I plan to enhance the parser so that it can process files >>> of this kind. >>> >>> >>> Is there a reason why you wanted to use this file and not the data >>> contained in this package here? >>> >>> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxD b.Mmusculus. >>> UCSC.mm9.knownGene.html >>> >>> >>> Marc >>> >>> >>> >>> On 05/02/2013 02:59 AM, Ugo Borello wrote: >>>> Dear Marc >>>> Sorry I was not precise on the origin of the gtf annotation file; >>>> I got the >>>> gtf file from here: >>>> http://tophat.cbcb.umd.edu/igenomes.shtml >>>> >>>> And more precisely from the Mus musculus/UCSC/mm9 folder >>>> Here the description of the content of the folder: >>>> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >>>> >>>> I realized reading the README.txt file that actually the genes.gtf file I >>>> used is the Ensembl annotation of the mm9 release. >>>> So, I changed dataSource = "Ensembl" in the function call and I >>>> got the same >>>> error message: >>>> Error in data.frame(..., check.names = FALSE) : >>>> arguments imply differing number of rows: 541775, 0 >>>> >>>> At the end of my previous email you have the result of calling: >>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >>>> >>>> >>>> Thank you >>>> >>>> Ugo >>>> >>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>> Date: Wed, 01 May 2013 15:33:02 -0700 >>>>> To: <bioconductor at="" r-project.org=""> >>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>> >>>>> Hi Ugo, >>>>> >>>>> Which UCSC file was it that you were trying to process? >>>>> >>>>> >>>>> Marc >>>>> >>>>> >>>>> >>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>>>> Good morning, >>>>>> >>>>>> I have a little problem creating a TranscriptDb object using >>>>>> the function >>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the >>>>>> overlaps >>>>>> of my genomic alignments with genes. >>>>>> >>>>>> >>>>>> I ran: >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + exonRankAttributeName = "exon_number", >>>>>> + chrominfo = chrominfo, >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> >>>>>> And I got this error message: >>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>> arguments imply differing number of rows: 541775, 0 >>>>>> >>>>>> >>>>>> "chrominfo" was (info retrieved from the fasta genome file): >>>>>> >>>>>>> chrominfo >>>>>> chrom length is_circular >>>>>> 1 chr10 129993255 FALSE >>>>>> 2 chr11 121843856 FALSE >>>>>> 3 chr12 121257530 FALSE >>>>>> 4 chr13 120284312 FALSE >>>>>> 5 chr14 125194864 FALSE >>>>>> 6 chr15 103494974 FALSE >>>>>> 7 chr16 98319150 FALSE >>>>>> 8 chr17 95272651 FALSE >>>>>> 9 chr18 90772031 FALSE >>>>>> 10 chr19 61342430 FALSE >>>>>> 11 chr1 197195432 FALSE >>>>>> 12 chr2 181748087 FALSE >>>>>> 13 chr3 159599783 FALSE >>>>>> 14 chr4 155630120 FALSE >>>>>> 15 chr5 152537259 FALSE >>>>>> 16 chr6 149517037 FALSE >>>>>> 17 chr7 152524553 FALSE >>>>>> 18 chr8 131738871 FALSE >>>>>> 19 chr9 124076172 FALSE >>>>>> 20 chrM 16299 TRUE >>>>>> 21 chrX 166650296 FALSE >>>>>> 22 chrY 15902555 FALSE >>>>>> >>>>>> >>>>>> I ran it again without the "exonRankAttributeName" argument and I got: >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + chrominfo = chrominfo, >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> extracting transcript information >>>>>> Estimating transcript ranges. >>>>>> Extracting gene IDs >>>>>> Processing splicing information for gtf file. >>>>>> Deducing exon rank from relative coordinates provided >>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, >>>>>> "transcripts$tx_chrom", >>>>>> : >>>>>> all the values in 'transcripts$tx_chrom' must be present in >>>>>> 'chrominfo$chrom' >>>>>> In addition: Warning message: >>>>>> In .deduceExonRankings(exs, format = "gtf") : >>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>> then please >>>>>> be >>>>>> sure that you have provided a valid attribute for exonRankAttributeName >>>>>> >>>>>> >>>>>> Without the "chrominfo" argument I got the same error message >>>>>> as the first >>>>>> time: >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + exonRankAttributeName = "exon_number", >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> >>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>> arguments imply differing number of rows: 541775, 0 >>>>>> >>>>>> >>>>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>>>> "chrominfo" arguments it worked but the warning reminded me of the >>>>>> "exonRankAttributeName" argument and the chromosome names are >>>>>> now different >>>>>> from the ones in the genome file and there is no info on their length >>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> + dataSource = "UCSC", >>>>>> + species = "Mus musculus") >>>>>> extracting transcript information >>>>>> Estimating transcript ranges. >>>>>> Extracting gene IDs >>>>>> Processing splicing information for gtf file. >>>>>> Deducing exon rank from relative coordinates provided >>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>> Now generating chrominfo from available sequence names. No chromosome >>>>>> length >>>>>> information is available. >>>>>> Warning messages: >>>>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>> then please >>>>>> be >>>>>> sure that you have provided a valid attribute for exonRankAttributeName >>>>>> 2: In matchCircularity(chroms, circ_seqs) : >>>>>> None of the strings in your circ_seqs argument match your seqnames. >>>>>> >>>>>>> seqinfo(txdb) >>>>>> Seqinfo of length 32 >>>>>> seqnames seqlengths isCircular genome >>>>>> chr13 <na> FALSE <na> >>>>>> chr9 <na> FALSE <na> >>>>>> chr6 <na> FALSE <na> >>>>>> chrX <na> FALSE <na> >>>>>> chr17 <na> FALSE <na> >>>>>> chr2 <na> FALSE <na> >>>>>> chr7 <na> FALSE <na> >>>>>> chr18 <na> FALSE <na> >>>>>> chr8 <na> FALSE <na> >>>>>> ... ... ... ... >>>>>> chrY_random <na> FALSE <na> >>>>>> chrX_random <na> FALSE <na> >>>>>> chr5_random <na> FALSE <na> >>>>>> chr4_random <na> FALSE <na> >>>>>> chrY <na> FALSE <na> >>>>>> chr7_random <na> FALSE <na> >>>>>> chr17_random <na> FALSE <na> >>>>>> chr13_random <na> FALSE <na> >>>>>> chr1_random <na> FALSE <na> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF? >>>>>> >>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>> exonRankAttributeName = "exon_number", >>>>>> chrominfo = chrominfo, >>>>>> dataSource = "UCSC", >>>>>> species = "Mus musculus") >>>>>> >>>>>> Why am I getting this unfair error message? >>>>>> Thank you for your help >>>>>> Ugo >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> FYI, this is my annFile (My gtf annotation file was downloaded together >>>>>> with a fasta file containing the mouse genome from UCSC): >>>>>> >>>>>>> annFile >>>>>> GRanges with 595632 ranges and 9 metadata columns: >>>>>> seqnames ranges strand | source >>>>>> type >>>>>> score phase >>>>>> <rle> <iranges> <rle> | <factor> >>>>>> <factor> >>>>>> <numeric> <integer> >>>>>> [1] chr1 [3204563, 3207049] - | unknown >>>>>> exon >>>>>> <na> <na> >>>>>> [2] chr1 [3206103, 3206105] - | unknown >>>>>> stop_codon >>>>>> <na> <na> >>>>>> [3] chr1 [3206106, 3207049] - | unknown >>>>>> CDS >>>>>> <na> 2 >>>>>> [4] chr1 [3411783, 3411982] - | unknown >>>>>> CDS >>>>>> <na> 1 >>>>>> [5] chr1 [3411783, 3411982] - | unknown >>>>>> exon >>>>>> <na> <na> >>>>>> ... ... ... ... ... ... >>>>>> ... >>>>>> ... ... >>>>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>>>> stop_codon >>>>>> <na> <na> >>>>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>>>> exon >>>>>> <na> <na> >>>>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>>>> CDS >>>>>> <na> 0 >>>>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>>>> start_codon >>>>>> <na> <na> >>>>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>>>> stop_codon >>>>>> <na> <na> >>>>>> gene_id transcript_id gene_name p_id >>>>>> tss_id >>>>>> <character> <character> <character> <character> >>>>>> <character> >>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>>>> TSS1881 >>>>>> ... ... ... ... ... >>>>>> ... >>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>>>> TSS19491 >>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>> TSS4342 >>>>>> --- >>>>>> seqlengths: >>>>>> chr1 chr10 chr11 chr12 ... >>>>>> chrX_random >>>>>> chrY chrY_random >>>>>> NA NA NA NA ... NA >>>>>> NA NA >>>>>> >>>>>> >>>>>>> sessionInfo() >>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>>> base >>>>>> >>>>>> other attached packages: >>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>>>> Biostrings_2.28.0 >>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>>>> GenomicRanges_1.12.2 >>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>>>> BSgenome_1.28.0 >>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>>>> lattice_0.20-15 >>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>>>> stats4_3.0.0 >>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>
ADD REPLYlink written 4.6 years ago by Ugo Borello340
Hi Ugo, This worked for me: library(OrganismDbi) gd <- list(join1 = c(GO.db="GOID", org.Mm.eg.db="GO"), join2 = c(org.Mm.eg.db="ENTREZID", TxDb.Mmusculus.UCSC.mm10.knownGene="GENEID")) makeOrganismPackage(pkgname = "Mus.musculus.mm9", graphData = gd, organism = "Mus musculus", version = "1.0", maintainer ="You <maintainer at="" someplace.org="">", author = "You", destDir = ".", license = "Artistic-2.0") Then I can do stuff like this (after I run R CMD INSTALL Mus.musculus.mm9) : library(Mus.musculus.mm9) select(Mus.musculus.mm9, head(keys(Mus.musculus.mm9,keytype="ENTREZID")),c("SYMBOL","TXSTRAND") ,keytype="ENTREZID") txs <- transcripts(Mus.musculus.mm9, columns="SYMBOL") Etc. As for your other question, that error means that the chromosome names in your input file are not all present in the chrominfo that you have used. Usually this means that there is a row that says something like chrom8_random (or something like that) as it's chromosome name in the input file. And you don't have that in your chrominfo table, so the database is stopping, because it doesn't have any way to know what chrom8_random is? To fix it, you can either remove unwanted rows from your input file or you can add new rows to your chrominfo object. Marc On 05/04/2013 07:18 AM, Ugo Borello wrote: > Dear Marc, > I apologize for bothering you again, but I was intrigued by the custom > Mus.musculus.mm9 object. It sounds very convenient indeed. > so I tried, as you suggested: > > gd<-c(org.Mm.eg.db='SYMBOL', > TxDb.Mmusculus.UCSC.mm9.knownGene="GENEID") > ## I also tried to set org.Mm.eg.db='ENTREZID' > > destination <- tempfile() > dir.create(destination) > makeOrganismPackage(pkgname = "Mus.musculus.mm9", > graphData = gd, > organism = "Mus musculus", > version = "1.0.0", > maintainer = "Package > Maintainer<maintainer at="" somewhere.com="">", > author = "SomeBody", > destDir = destination, > license = "Artistic-2.0") > > and I got: > Error in cbind(pkgs, keys) : > number of rows of matrices must match (see arg 2) > > I am surely missing something here. Any suggestions? > > > And, coming back to makeTranscriptDbFromGFF. > I obtained my gtf annotation file from UCSC using genePredToGtf, as > the UCSC people suggest > (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format) > > txdb <-makeTranscriptDbFromGFF(file = 'mm9.gtf', format = "gtf", > + exonRankAttributeName = "exon_number", > + chrominfo = chrominfo, > + dataSource = "UCSC", > + species = "Mus musculus") > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Prepare the 'metadata' data frame ... metadata: OK > Error in .checkForeignKey(transcripts_tx_chrom, NA, > "transcripts$tx_chrom", : > all the values in 'transcripts$tx_chrom' must be present in > 'chrominfo$chrom' > > Why I get this error now? > >> chrominfo > chrom length is_circular > 1 chr10 129993255 FALSE > 2 chr11 121843856 FALSE > 3 chr12 121257530 FALSE > 4 chr13 120284312 FALSE > 5 chr14 125194864 FALSE > 6 chr15 103494974 FALSE > 7 chr16 98319150 FALSE > 8 chr17 95272651 FALSE > 9 chr18 90772031 FALSE > 10 chr19 61342430 FALSE > 11 chr1 197195432 FALSE > 12 chr2 181748087 FALSE > 13 chr3 159599783 FALSE > 14 chr4 155630120 FALSE > 15 chr5 152537259 FALSE > 16 chr6 149517037 FALSE > 17 chr7 152524553 FALSE > 18 chr8 131738871 FALSE > 19 chr9 124076172 FALSE > 20 chrM 16299 FALSE > 21 chrX 166650296 FALSE > 22 chrY 15902555 FALSE > > Thank you very much for your time and patience. > > Ugo > > > > > P.S. > This is the annotation >> annFile3<- import.gff('mm9.gtf', format='gtf', asRangedData=FALSE) >> annFile3 > GRanges with 962651 ranges and 9 metadata columns: > seqnames ranges strand | source > type score phase gene_id transcript_id > <rle> <iranges> <rle> | <factor> > <factor> <numeric> <integer> <character> <character> > [1] chr1 [3195985, 3197398] - | knownGene > exon <na> <na> uc007aet.1 uc007aet.1 > [2] chr1 [3203520, 3205713] - | knownGene > exon <na> <na> uc007aet.1 uc007aet.1 > [3] chr1 [3204563, 3207049] - | knownGene > exon <na> <na> Q5GH67 uc007aeu.1 > [4] chr1 [3206106, 3207049] - | knownGene > CDS <na> 2 Q5GH67 uc007aeu.1 > [5] chr1 [3411783, 3411982] - | knownGene > exon <na> <na> Q5GH67 uc007aeu.1 > ... ... ... ... ... ... > ... ... ... ... ... > [962647] chrY_random [58502132, 58502365] + | knownGene > CDS <na> 0 Q62458 uc012htl.1 > [962648] chrY_random [58502369, 58502946] + | knownGene > exon <na> <na> Q62458 uc012htl.1 > [962649] chrY_random [58502369, 58502812] + | knownGene > CDS <na> 0 Q62458 uc012htl.1 > [962650] chrY_random [58502132, 58502134] + | knownGene > start_codon <na> 0 Q62458 uc012htl.1 > [962651] chrY_random [58502813, 58502815] + | knownGene > stop_codon <na> 0 Q62458 uc012htl.1 > exon_number exon_id gene_name > <numeric> <character> <character> > [1] 1 uc007aet.1.1 <na> > [2] 2 uc007aet.1.2 <na> > [3] 1 uc007aeu.1.1 Q5GH67 > [4] 1 uc007aeu.1.1 Q5GH67 > [5] 2 uc007aeu.1.2 Q5GH67 > ... ... ... ... > [962647] 1 uc012htl.1.1 Q62458 > [962648] 2 uc012htl.1.2 Q62458 > [962649] 2 uc012htl.1.2 Q62458 > [962650] 1 uc012htl.1.1 Q62458 > [962651] 1 uc012htl.1.1 Q62458 > --- > seqlengths: > chr1 chr10 chr11 chr12 chr13 > ... chrX chrX_random chrY chrY_random > NA NA NA NA NA > ... NA > > > Quoting Marc Carlson <mcarlson at="" fhcrc.org="">: > >> Hi Ugo, >> >> >> On 05/03/2013 04:00 AM, Ugo Borello wrote: >>> Dear Carl, >>> Thank you very much; it makes sense now. >>> >>> To quantify gene expression of my RNASeq samples I use your >>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the >>> alignments to >>> the BSgenome.Mmusculus.UCSC.mm9 genome. >>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I >>> wanted to use their annotation. >>> >>> >>> Anyway, I have now a general question. >>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes >>>> genome<- scanFaIndex('genome.fa') >>>> seqlengths(genome) >>> chr10 chr11 chr12 chr13 chr14 chr15 chr16 >>> chr17 chr18 chr19 >>> 129993255 121843856 121257530 120284312 125194864 103494974 98319150 >>> 95272651 90772031 61342430 >>> chr1 chr2 chr3 chr4 chr5 chr6 chr7 >>> chr8 chr9 chrM >>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553 >>> 131738871 124076172 16299 >>> chrX chrY >>> 166650296 15902555 >>> >>> For your TxDb.Mmusculus.UCSC.mm9.knownGene >>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) >>> chr1 chr2 chr3 chr4 chr5 >>> chr6 chr7 chr8 chr9 >>> 197195432 181748087 159599783 155630120 152537259 >>> 149517037 152524553 131738871 124076172 >>> chr10 chr11 chr12 chr13 chr14 >>> chr15 chr16 chr17 chr18 >>> 129993255 121843856 121257530 120284312 125194864 >>> 103494974 98319150 95272651 90772031 >>> chr19 chrX chrY chrM chr1_random >>> chr3_random chr4_random chr5_random chr7_random >>> 61342430 166650296 15902555 16299 1231697 >>> 41899 160594 357350 362490 >>> chr8_random chr9_random chr13_random chr16_random chr17_random >>> chrX_random chrY_random chrUn_random >>> 849593 449403 400311 3994 628739 >>> 1785075 58682461 5900358 >>> >>> >>> So. I want to filter out the '_random' stuff; is this the only and >>> right way >>> to do it? >> >> It depends on whether you want to use the _random stuff. My impression >> of how people use this is that most people don't. So they would use >> isActiveSeq() to toggle those off. >> >>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene >>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE >>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, >>> + "chr12"=TRUE,"chr13"=TRUE, >>> + "chr14"=TRUE,"chr15"=TRUE, >>> + "chr16"=TRUE, "chr17"=TRUE, >>> + "chr18"=TRUE, "chr19"=TRUE, >>> + "chr1"=TRUE, "chr2"=TRUE, >>> + "chr3"=TRUE, "chr4"=TRUE, >>> + "chr5"=TRUE, "chr6"=TRUE, >>> + "chr7"=TRUE, "chr8"=TRUE, >>> + "chr9"=TRUE, "chrM"=TRUE, >>> + "chrX"=TRUE, "chrY"=TRUE) >>> Then get my gene info this way? >>>> genesInfo<- exons(ann, columns='gene_id') >>> How can I add also gene names or symbols? >> Again, it depends on what you want to do. If you are dealing with just >> a TranscriptDb like this, then there are not gene symbols attached >> already. So for that object yes, you would need to do it like that. >> >> >> Now if you wanted gene symbols they are pretty easy to get. You can go >> and get gene symbols by loading an org package like this: >> >> library(org.Mm.eg.db) >> cols(org.Mm.eg.db) >> >> And then you could use select() to retrieve those (by using the gene >> IDs as keys from your TranscriptDb object). But some people find this >> extra step to be inconvenient, so I have created another route. And >> that is to use a OrganismDb object. There is a package for this >> already that I will use to demo here: >> >> library(Mus.musculus) >> cols(Mus.musculus) >> >> You will notice that this kind of object has all the stuff you want in >> one place. I suspect that you will find that to be a lot more >> convenient: >> >> This basically means that you can do the thing you were interested in >> like this: >> >> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL")) >> >> *But* in your case, you are using mm9, so you will want to make a >> custom object (the Mus.musculus object is based on mm10). This is not >> hard to do, but it is an extra step. You can read how to do it in >> section 2 of the following vignette: >> >> http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDb i/inst/doc/OrganismDbi.pdf >> >> >> >> >>> >>> >>> >>> Thank you again for your help. >>> >>> Ugo >>> >>> >>> >>> >>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>> Date: Thu, 02 May 2013 15:52:38 -0700 >>>> To: Ugo Borello <ugo.borello at="" inserm.fr=""> >>>> Cc: <bioconductor at="" r-project.org=""> >>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>> >>>> Hi Ugo, >>>> >>>> The 15 GB tarball you sent me to contains several different GTF files >>>> for genes. I grabbed this one as it seemed to be the most recent: >>>> >>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-0 1-24/Genes/ge >>>> >>>> nes.gtf >>>> >>>> So looking at this file I can reproduce the problem you mentioned. >>>> And >>>> it shows me two problems. The 1st problem is that the only field that >>>> seems to contain any information about exon positions is called phase >>>> (and not "exon_number" as was in the code I see from before). >>>> There is >>>> not actually any field called "exon_number" in this file. Either way, >>>> one thing you can check is to make sure that the string you give >>>> here is >>>> the same as the appropriate field name that is used by the file. >>>> There >>>> is no way to know this information in advance since GTF does not >>>> specify >>>> how to encode this information (and in fact the information is >>>> entirely >>>> optional). >>>> >>>> The second problem is that even "phase" can't work right now since the >>>> authors of this gtf file have decided to only associate the exon rank >>>> information only with CDS and never with exons features. So there is >>>> not any actual 'exon' position information in this file, only >>>> information for CDS positions. Now that I see people doing these >>>> files >>>> in this way, I plan to enhance the parser so that it can process files >>>> of this kind. >>>> >>>> >>>> Is there a reason why you wanted to use this file and not the data >>>> contained in this package here? >>>> >>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/Tx Db.Mmusculus. >>>> >>>> UCSC.mm9.knownGene.html >>>> >>>> >>>> Marc >>>> >>>> >>>> >>>> On 05/02/2013 02:59 AM, Ugo Borello wrote: >>>>> Dear Marc >>>>> Sorry I was not precise on the origin of the gtf annotation file; >>>>> I got the >>>>> gtf file from here: >>>>> http://tophat.cbcb.umd.edu/igenomes.shtml >>>>> >>>>> And more precisely from the Mus musculus/UCSC/mm9 folder >>>>> Here the description of the content of the folder: >>>>> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >>>>> >>>>> I realized reading the README.txt file that actually the genes.gtf >>>>> file I >>>>> used is the Ensembl annotation of the mm9 release. >>>>> So, I changed dataSource = "Ensembl" in the function call and I >>>>> got the same >>>>> error message: >>>>> Error in data.frame(..., check.names = FALSE) : >>>>> arguments imply differing number of rows: 541775, 0 >>>>> >>>>> At the end of my previous email you have the result of calling: >>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >>>>> >>>>> >>>>> Thank you >>>>> >>>>> Ugo >>>>> >>>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>>> Date: Wed, 01 May 2013 15:33:02 -0700 >>>>>> To: <bioconductor at="" r-project.org=""> >>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>>> >>>>>> Hi Ugo, >>>>>> >>>>>> Which UCSC file was it that you were trying to process? >>>>>> >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> >>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>>>>> Good morning, >>>>>>> >>>>>>> I have a little problem creating a TranscriptDb object using >>>>>>> the function >>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the >>>>>>> overlaps >>>>>>> of my genomic alignments with genes. >>>>>>> >>>>>>> >>>>>>> I ran: >>>>>>> >>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>> + exonRankAttributeName = "exon_number", >>>>>>> + chrominfo = chrominfo, >>>>>>> + dataSource = "UCSC", >>>>>>> + species = "Mus musculus") >>>>>>> >>>>>>> And I got this error message: >>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>> >>>>>>> >>>>>>> "chrominfo" was (info retrieved from the fasta genome file): >>>>>>> >>>>>>>> chrominfo >>>>>>> chrom length is_circular >>>>>>> 1 chr10 129993255 FALSE >>>>>>> 2 chr11 121843856 FALSE >>>>>>> 3 chr12 121257530 FALSE >>>>>>> 4 chr13 120284312 FALSE >>>>>>> 5 chr14 125194864 FALSE >>>>>>> 6 chr15 103494974 FALSE >>>>>>> 7 chr16 98319150 FALSE >>>>>>> 8 chr17 95272651 FALSE >>>>>>> 9 chr18 90772031 FALSE >>>>>>> 10 chr19 61342430 FALSE >>>>>>> 11 chr1 197195432 FALSE >>>>>>> 12 chr2 181748087 FALSE >>>>>>> 13 chr3 159599783 FALSE >>>>>>> 14 chr4 155630120 FALSE >>>>>>> 15 chr5 152537259 FALSE >>>>>>> 16 chr6 149517037 FALSE >>>>>>> 17 chr7 152524553 FALSE >>>>>>> 18 chr8 131738871 FALSE >>>>>>> 19 chr9 124076172 FALSE >>>>>>> 20 chrM 16299 TRUE >>>>>>> 21 chrX 166650296 FALSE >>>>>>> 22 chrY 15902555 FALSE >>>>>>> >>>>>>> >>>>>>> I ran it again without the "exonRankAttributeName" argument and >>>>>>> I got: >>>>>>> >>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>> + chrominfo = chrominfo, >>>>>>> + dataSource = "UCSC", >>>>>>> + species = "Mus musculus") >>>>>>> extracting transcript information >>>>>>> Estimating transcript ranges. >>>>>>> Extracting gene IDs >>>>>>> Processing splicing information for gtf file. >>>>>>> Deducing exon rank from relative coordinates provided >>>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, >>>>>>> "transcripts$tx_chrom", >>>>>>> : >>>>>>> all the values in 'transcripts$tx_chrom' must be present in >>>>>>> 'chrominfo$chrom' >>>>>>> In addition: Warning message: >>>>>>> In .deduceExonRankings(exs, format = "gtf") : >>>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>>> then please >>>>>>> be >>>>>>> sure that you have provided a valid attribute for >>>>>>> exonRankAttributeName >>>>>>> >>>>>>> >>>>>>> Without the "chrominfo" argument I got the same error message >>>>>>> as the first >>>>>>> time: >>>>>>> >>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>> + exonRankAttributeName = "exon_number", >>>>>>> + dataSource = "UCSC", >>>>>>> + species = "Mus musculus") >>>>>>> >>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>> >>>>>>> >>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>>>>> "chrominfo" arguments it worked but the warning reminded me of the >>>>>>> "exonRankAttributeName" argument and the chromosome names are >>>>>>> now different >>>>>>> from the ones in the genome file and there is no info on their >>>>>>> length >>>>>>> >>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>> + dataSource = "UCSC", >>>>>>> + species = "Mus musculus") >>>>>>> extracting transcript information >>>>>>> Estimating transcript ranges. >>>>>>> Extracting gene IDs >>>>>>> Processing splicing information for gtf file. >>>>>>> Deducing exon rank from relative coordinates provided >>>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>>> Now generating chrominfo from available sequence names. No >>>>>>> chromosome >>>>>>> length >>>>>>> information is available. >>>>>>> Warning messages: >>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>>> then please >>>>>>> be >>>>>>> sure that you have provided a valid attribute for >>>>>>> exonRankAttributeName >>>>>>> 2: In matchCircularity(chroms, circ_seqs) : >>>>>>> None of the strings in your circ_seqs argument match your >>>>>>> seqnames. >>>>>>> >>>>>>>> seqinfo(txdb) >>>>>>> Seqinfo of length 32 >>>>>>> seqnames seqlengths isCircular genome >>>>>>> chr13 <na> FALSE <na> >>>>>>> chr9 <na> FALSE <na> >>>>>>> chr6 <na> FALSE <na> >>>>>>> chrX <na> FALSE <na> >>>>>>> chr17 <na> FALSE <na> >>>>>>> chr2 <na> FALSE <na> >>>>>>> chr7 <na> FALSE <na> >>>>>>> chr18 <na> FALSE <na> >>>>>>> chr8 <na> FALSE <na> >>>>>>> ... ... ... ... >>>>>>> chrY_random <na> FALSE <na> >>>>>>> chrX_random <na> FALSE <na> >>>>>>> chr5_random <na> FALSE <na> >>>>>>> chr4_random <na> FALSE <na> >>>>>>> chrY <na> FALSE <na> >>>>>>> chr7_random <na> FALSE <na> >>>>>>> chr17_random <na> FALSE <na> >>>>>>> chr13_random <na> FALSE <na> >>>>>>> chr1_random <na> FALSE <na> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> What am I doing wrong in my original call to >>>>>>> makeTranscriptDbFromGFF? >>>>>>> >>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>> exonRankAttributeName = >>>>>>> "exon_number", >>>>>>> chrominfo = chrominfo, >>>>>>> dataSource = "UCSC", >>>>>>> species = "Mus musculus") >>>>>>> >>>>>>> Why am I getting this unfair error message? >>>>>>> Thank you for your help >>>>>>> Ugo >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> FYI, this is my annFile (My gtf annotation file was downloaded >>>>>>> together >>>>>>> with a fasta file containing the mouse genome from UCSC): >>>>>>> >>>>>>>> annFile >>>>>>> GRanges with 595632 ranges and 9 metadata columns: >>>>>>> seqnames ranges strand | source >>>>>>> type >>>>>>> score phase >>>>>>> <rle> <iranges> <rle> | <factor> >>>>>>> <factor> >>>>>>> <numeric> <integer> >>>>>>> [1] chr1 [3204563, 3207049] - | unknown >>>>>>> exon >>>>>>> <na> <na> >>>>>>> [2] chr1 [3206103, 3206105] - | unknown >>>>>>> stop_codon >>>>>>> <na> <na> >>>>>>> [3] chr1 [3206106, 3207049] - | unknown >>>>>>> CDS >>>>>>> <na> 2 >>>>>>> [4] chr1 [3411783, 3411982] - | unknown >>>>>>> CDS >>>>>>> <na> 1 >>>>>>> [5] chr1 [3411783, 3411982] - | unknown >>>>>>> exon >>>>>>> <na> <na> >>>>>>> ... ... ... ... ... ... >>>>>>> ... >>>>>>> ... ... >>>>>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>>>>> stop_codon >>>>>>> <na> <na> >>>>>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>>>>> exon >>>>>>> <na> <na> >>>>>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>>>>> CDS >>>>>>> <na> 0 >>>>>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>>>>> start_codon >>>>>>> <na> <na> >>>>>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>>>>> stop_codon >>>>>>> <na> <na> >>>>>>> gene_id transcript_id gene_name p_id >>>>>>> tss_id >>>>>>> <character> <character> <character> <character> >>>>>>> <character> >>>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>> TSS1881 >>>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>> TSS1881 >>>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>> TSS1881 >>>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>> TSS1881 >>>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>> TSS1881 >>>>>>> ... ... ... ... ... >>>>>>> ... >>>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>>>>> TSS19491 >>>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>> TSS4342 >>>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>> TSS4342 >>>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>> TSS4342 >>>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>> TSS4342 >>>>>>> --- >>>>>>> seqlengths: >>>>>>> chr1 chr10 chr11 chr12 ... chrX_random >>>>>>> chrY chrY_random >>>>>>> NA NA NA NA ... NA >>>>>>> NA NA >>>>>>> >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 3.0.0 (2013-04-03) >>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>>>> base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>>>>> Biostrings_2.28.0 >>>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>>>>> GenomicRanges_1.12.2 >>>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>>>>> BSgenome_1.28.0 >>>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>>>>> lattice_0.20-15 >>>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>>>>> stats4_3.0.0 >>>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > >
ADD REPLYlink written 4.6 years ago by Marc Carlson7.2k
0
gravatar for Marc Carlson
4.6 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:
Hi Ugo, So I have good news and bad news regarding that data file you wanted to use from: http://tophat.cbcb.umd.edu/igenomes.shtml The good news is that I have modified makeTranscriptDbFromGFF() to notice whenever a file only has exon rank information for CDS and to do the right thing (and make use of that information) while issuing a warning that it is doing so. The bad news is that after looking more carefully at the data from the site listed above, there is no way to make that data into a transcriptome without inferring the exon ranks. At least not in it's current state. The files on offer simply don't contain exon rank information. What I initially thought might be exon ranks is (after more careful inspection) something else, and so can't be used for that. So you probably should go with the advice from my last message and make a custom mm9 package that makes use of the UCSC transcriptome (or maybe one from BiomaRt depending on what you have aligned your data to). But I would not recommend using the data at that web site to make transcriptomes unless they can give you a file that has exon rank information. Marc On 05/06/2013 01:32 PM, Marc Carlson wrote: > Hi Ugo, > > This worked for me: > > library(OrganismDbi) > > gd <- list(join1 = c(GO.db="GOID", org.Mm.eg.db="GO"), > join2 = c(org.Mm.eg.db="ENTREZID", > TxDb.Mmusculus.UCSC.mm10.knownGene="GENEID")) > > makeOrganismPackage(pkgname = "Mus.musculus.mm9", > graphData = gd, > organism = "Mus musculus", > version = "1.0", > maintainer ="You <maintainer at="" someplace.org="">", > author = "You", > destDir = ".", > license = "Artistic-2.0") > > > Then I can do stuff like this (after I run R CMD INSTALL > Mus.musculus.mm9) : > > library(Mus.musculus.mm9) > > select(Mus.musculus.mm9, > head(keys(Mus.musculus.mm9,keytype="ENTREZID")),c("SYMBOL","TXSTRAND "),keytype="ENTREZID") > > txs <- transcripts(Mus.musculus.mm9, columns="SYMBOL") > > > Etc. > > > As for your other question, that error means that the chromosome names > in your input file are not all present in the chrominfo that you have > used. Usually this means that there is a row that says something like > chrom8_random (or something like that) as it's chromosome name in the > input file. And you don't have that in your chrominfo table, so the > database is stopping, because it doesn't have any way to know what > chrom8_random is? > > To fix it, you can either remove unwanted rows from your input file or > you can add new rows to your chrominfo object. > > > Marc > > > > On 05/04/2013 07:18 AM, Ugo Borello wrote: >> Dear Marc, >> I apologize for bothering you again, but I was intrigued by the >> custom Mus.musculus.mm9 object. It sounds very convenient indeed. >> so I tried, as you suggested: >> >> gd<-c(org.Mm.eg.db='SYMBOL', >> TxDb.Mmusculus.UCSC.mm9.knownGene="GENEID") >> ## I also tried to set org.Mm.eg.db='ENTREZID' >> >> destination <- tempfile() >> dir.create(destination) >> makeOrganismPackage(pkgname = "Mus.musculus.mm9", >> graphData = gd, >> organism = "Mus musculus", >> version = "1.0.0", >> maintainer = "Package >> Maintainer<maintainer at="" somewhere.com="">", >> author = "SomeBody", >> destDir = destination, >> license = "Artistic-2.0") >> >> and I got: >> Error in cbind(pkgs, keys) : >> number of rows of matrices must match (see arg 2) >> >> I am surely missing something here. Any suggestions? >> >> >> And, coming back to makeTranscriptDbFromGFF. >> I obtained my gtf annotation file from UCSC using genePredToGtf, as >> the UCSC people suggest >> (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format) >> >> txdb <-makeTranscriptDbFromGFF(file = 'mm9.gtf', format = "gtf", >> + exonRankAttributeName = "exon_number", >> + chrominfo = chrominfo, >> + dataSource = "UCSC", >> + species = "Mus musculus") >> extracting transcript information >> Estimating transcript ranges. >> Extracting gene IDs >> Processing splicing information for gtf file. >> Prepare the 'metadata' data frame ... metadata: OK >> Error in .checkForeignKey(transcripts_tx_chrom, NA, >> "transcripts$tx_chrom", : >> all the values in 'transcripts$tx_chrom' must be present in >> 'chrominfo$chrom' >> >> Why I get this error now? >> >>> chrominfo >> chrom length is_circular >> 1 chr10 129993255 FALSE >> 2 chr11 121843856 FALSE >> 3 chr12 121257530 FALSE >> 4 chr13 120284312 FALSE >> 5 chr14 125194864 FALSE >> 6 chr15 103494974 FALSE >> 7 chr16 98319150 FALSE >> 8 chr17 95272651 FALSE >> 9 chr18 90772031 FALSE >> 10 chr19 61342430 FALSE >> 11 chr1 197195432 FALSE >> 12 chr2 181748087 FALSE >> 13 chr3 159599783 FALSE >> 14 chr4 155630120 FALSE >> 15 chr5 152537259 FALSE >> 16 chr6 149517037 FALSE >> 17 chr7 152524553 FALSE >> 18 chr8 131738871 FALSE >> 19 chr9 124076172 FALSE >> 20 chrM 16299 FALSE >> 21 chrX 166650296 FALSE >> 22 chrY 15902555 FALSE >> >> Thank you very much for your time and patience. >> >> Ugo >> >> >> >> >> P.S. >> This is the annotation >>> annFile3<- import.gff('mm9.gtf', format='gtf', asRangedData=FALSE) >>> annFile3 >> GRanges with 962651 ranges and 9 metadata columns: >> seqnames ranges strand | source >> type score phase gene_id transcript_id >> <rle> <iranges> <rle> | <factor> >> <factor> <numeric> <integer> <character> <character> >> [1] chr1 [3195985, 3197398] - | knownGene >> exon <na> <na> uc007aet.1 uc007aet.1 >> [2] chr1 [3203520, 3205713] - | knownGene >> exon <na> <na> uc007aet.1 uc007aet.1 >> [3] chr1 [3204563, 3207049] - | knownGene >> exon <na> <na> Q5GH67 uc007aeu.1 >> [4] chr1 [3206106, 3207049] - | knownGene >> CDS <na> 2 Q5GH67 uc007aeu.1 >> [5] chr1 [3411783, 3411982] - | knownGene >> exon <na> <na> Q5GH67 uc007aeu.1 >> ... ... ... ... ... ... >> ... ... ... ... ... >> [962647] chrY_random [58502132, 58502365] + | knownGene >> CDS <na> 0 Q62458 uc012htl.1 >> [962648] chrY_random [58502369, 58502946] + | knownGene >> exon <na> <na> Q62458 uc012htl.1 >> [962649] chrY_random [58502369, 58502812] + | knownGene >> CDS <na> 0 Q62458 uc012htl.1 >> [962650] chrY_random [58502132, 58502134] + | knownGene >> start_codon <na> 0 Q62458 uc012htl.1 >> [962651] chrY_random [58502813, 58502815] + | knownGene >> stop_codon <na> 0 Q62458 uc012htl.1 >> exon_number exon_id gene_name >> <numeric> <character> <character> >> [1] 1 uc007aet.1.1 <na> >> [2] 2 uc007aet.1.2 <na> >> [3] 1 uc007aeu.1.1 Q5GH67 >> [4] 1 uc007aeu.1.1 Q5GH67 >> [5] 2 uc007aeu.1.2 Q5GH67 >> ... ... ... ... >> [962647] 1 uc012htl.1.1 Q62458 >> [962648] 2 uc012htl.1.2 Q62458 >> [962649] 2 uc012htl.1.2 Q62458 >> [962650] 1 uc012htl.1.1 Q62458 >> [962651] 1 uc012htl.1.1 Q62458 >> --- >> seqlengths: >> chr1 chr10 chr11 chr12 chr13 >> ... chrX chrX_random chrY chrY_random >> NA NA NA NA NA >> ... NA >> >> >> Quoting Marc Carlson <mcarlson at="" fhcrc.org="">: >> >>> Hi Ugo, >>> >>> >>> On 05/03/2013 04:00 AM, Ugo Borello wrote: >>>> Dear Carl, >>>> Thank you very much; it makes sense now. >>>> >>>> To quantify gene expression of my RNASeq samples I use your >>>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the >>>> alignments to >>>> the BSgenome.Mmusculus.UCSC.mm9 genome. >>>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I >>>> wanted to use their annotation. >>>> >>>> >>>> Anyway, I have now a general question. >>>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes >>>>> genome<- scanFaIndex('genome.fa') >>>>> seqlengths(genome) >>>> chr10 chr11 chr12 chr13 chr14 chr15 chr16 >>>> chr17 chr18 chr19 >>>> 129993255 121843856 121257530 120284312 125194864 103494974 98319150 >>>> 95272651 90772031 61342430 >>>> chr1 chr2 chr3 chr4 chr5 chr6 chr7 >>>> chr8 chr9 chrM >>>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553 >>>> 131738871 124076172 16299 >>>> chrX chrY >>>> 166650296 15902555 >>>> >>>> For your TxDb.Mmusculus.UCSC.mm9.knownGene >>>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) >>>> chr1 chr2 chr3 chr4 chr5 >>>> chr6 chr7 chr8 chr9 >>>> 197195432 181748087 159599783 155630120 152537259 >>>> 149517037 152524553 131738871 124076172 >>>> chr10 chr11 chr12 chr13 chr14 >>>> chr15 chr16 chr17 chr18 >>>> 129993255 121843856 121257530 120284312 125194864 >>>> 103494974 98319150 95272651 90772031 >>>> chr19 chrX chrY chrM chr1_random >>>> chr3_random chr4_random chr5_random chr7_random >>>> 61342430 166650296 15902555 16299 1231697 >>>> 41899 160594 357350 362490 >>>> chr8_random chr9_random chr13_random chr16_random chr17_random >>>> chrX_random chrY_random chrUn_random >>>> 849593 449403 400311 3994 628739 >>>> 1785075 58682461 5900358 >>>> >>>> >>>> So. I want to filter out the '_random' stuff; is this the only and >>>> right way >>>> to do it? >>> >>> It depends on whether you want to use the _random stuff. My impression >>> of how people use this is that most people don't. So they would use >>> isActiveSeq() to toggle those off. >>> >>>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene >>>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE >>>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, >>>> + "chr12"=TRUE,"chr13"=TRUE, >>>> + "chr14"=TRUE,"chr15"=TRUE, >>>> + "chr16"=TRUE, "chr17"=TRUE, >>>> + "chr18"=TRUE, "chr19"=TRUE, >>>> + "chr1"=TRUE, "chr2"=TRUE, >>>> + "chr3"=TRUE, "chr4"=TRUE, >>>> + "chr5"=TRUE, "chr6"=TRUE, >>>> + "chr7"=TRUE, "chr8"=TRUE, >>>> + "chr9"=TRUE, "chrM"=TRUE, >>>> + "chrX"=TRUE, "chrY"=TRUE) >>>> Then get my gene info this way? >>>>> genesInfo<- exons(ann, columns='gene_id') >>>> How can I add also gene names or symbols? >>> Again, it depends on what you want to do. If you are dealing with just >>> a TranscriptDb like this, then there are not gene symbols attached >>> already. So for that object yes, you would need to do it like that. >>> >>> >>> Now if you wanted gene symbols they are pretty easy to get. You can go >>> and get gene symbols by loading an org package like this: >>> >>> library(org.Mm.eg.db) >>> cols(org.Mm.eg.db) >>> >>> And then you could use select() to retrieve those (by using the gene >>> IDs as keys from your TranscriptDb object). But some people find this >>> extra step to be inconvenient, so I have created another route. And >>> that is to use a OrganismDb object. There is a package for this >>> already that I will use to demo here: >>> >>> library(Mus.musculus) >>> cols(Mus.musculus) >>> >>> You will notice that this kind of object has all the stuff you want in >>> one place. I suspect that you will find that to be a lot more >>> convenient: >>> >>> This basically means that you can do the thing you were interested in >>> like this: >>> >>> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL")) >>> >>> *But* in your case, you are using mm9, so you will want to make a >>> custom object (the Mus.musculus object is based on mm10). This is not >>> hard to do, but it is an extra step. You can read how to do it in >>> section 2 of the following vignette: >>> >>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismD bi/inst/doc/OrganismDbi.pdf >>> >>> >>> >>> >>>> >>>> >>>> >>>> Thank you again for your help. >>>> >>>> Ugo >>>> >>>> >>>> >>>> >>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>> Date: Thu, 02 May 2013 15:52:38 -0700 >>>>> To: Ugo Borello <ugo.borello at="" inserm.fr=""> >>>>> Cc: <bioconductor at="" r-project.org=""> >>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>> >>>>> Hi Ugo, >>>>> >>>>> The 15 GB tarball you sent me to contains several different GTF files >>>>> for genes. I grabbed this one as it seemed to be the most recent: >>>>> >>>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15- 01-24/Genes/ge >>>>> >>>>> nes.gtf >>>>> >>>>> So looking at this file I can reproduce the problem you >>>>> mentioned. And >>>>> it shows me two problems. The 1st problem is that the only field >>>>> that >>>>> seems to contain any information about exon positions is called phase >>>>> (and not "exon_number" as was in the code I see from before). >>>>> There is >>>>> not actually any field called "exon_number" in this file. Either way, >>>>> one thing you can check is to make sure that the string you give >>>>> here is >>>>> the same as the appropriate field name that is used by the file. >>>>> There >>>>> is no way to know this information in advance since GTF does not >>>>> specify >>>>> how to encode this information (and in fact the information is >>>>> entirely >>>>> optional). >>>>> >>>>> The second problem is that even "phase" can't work right now since >>>>> the >>>>> authors of this gtf file have decided to only associate the exon rank >>>>> information only with CDS and never with exons features. So there is >>>>> not any actual 'exon' position information in this file, only >>>>> information for CDS positions. Now that I see people doing these >>>>> files >>>>> in this way, I plan to enhance the parser so that it can process >>>>> files >>>>> of this kind. >>>>> >>>>> >>>>> Is there a reason why you wanted to use this file and not the data >>>>> contained in this package here? >>>>> >>>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/T xDb.Mmusculus. >>>>> >>>>> UCSC.mm9.knownGene.html >>>>> >>>>> >>>>> Marc >>>>> >>>>> >>>>> >>>>> On 05/02/2013 02:59 AM, Ugo Borello wrote: >>>>>> Dear Marc >>>>>> Sorry I was not precise on the origin of the gtf annotation file; >>>>>> I got the >>>>>> gtf file from here: >>>>>> http://tophat.cbcb.umd.edu/igenomes.shtml >>>>>> >>>>>> And more precisely from the Mus musculus/UCSC/mm9 folder >>>>>> Here the description of the content of the folder: >>>>>> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >>>>>> >>>>>> I realized reading the README.txt file that actually the >>>>>> genes.gtf file I >>>>>> used is the Ensembl annotation of the mm9 release. >>>>>> So, I changed dataSource = "Ensembl" in the function call and I >>>>>> got the same >>>>>> error message: >>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>> arguments imply differing number of rows: 541775, 0 >>>>>> >>>>>> At the end of my previous email you have the result of calling: >>>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> Ugo >>>>>> >>>>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>>>> Date: Wed, 01 May 2013 15:33:02 -0700 >>>>>>> To: <bioconductor at="" r-project.org=""> >>>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>>>> >>>>>>> Hi Ugo, >>>>>>> >>>>>>> Which UCSC file was it that you were trying to process? >>>>>>> >>>>>>> >>>>>>> Marc >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>>>>>> Good morning, >>>>>>>> >>>>>>>> I have a little problem creating a TranscriptDb object using >>>>>>>> the function >>>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count >>>>>>>> the >>>>>>>> overlaps >>>>>>>> of my genomic alignments with genes. >>>>>>>> >>>>>>>> >>>>>>>> I ran: >>>>>>>> >>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>> + exonRankAttributeName = "exon_number", >>>>>>>> + chrominfo = chrominfo, >>>>>>>> + dataSource = "UCSC", >>>>>>>> + species = "Mus musculus") >>>>>>>> >>>>>>>> And I got this error message: >>>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>>> >>>>>>>> >>>>>>>> "chrominfo" was (info retrieved from the fasta genome file): >>>>>>>> >>>>>>>>> chrominfo >>>>>>>> chrom length is_circular >>>>>>>> 1 chr10 129993255 FALSE >>>>>>>> 2 chr11 121843856 FALSE >>>>>>>> 3 chr12 121257530 FALSE >>>>>>>> 4 chr13 120284312 FALSE >>>>>>>> 5 chr14 125194864 FALSE >>>>>>>> 6 chr15 103494974 FALSE >>>>>>>> 7 chr16 98319150 FALSE >>>>>>>> 8 chr17 95272651 FALSE >>>>>>>> 9 chr18 90772031 FALSE >>>>>>>> 10 chr19 61342430 FALSE >>>>>>>> 11 chr1 197195432 FALSE >>>>>>>> 12 chr2 181748087 FALSE >>>>>>>> 13 chr3 159599783 FALSE >>>>>>>> 14 chr4 155630120 FALSE >>>>>>>> 15 chr5 152537259 FALSE >>>>>>>> 16 chr6 149517037 FALSE >>>>>>>> 17 chr7 152524553 FALSE >>>>>>>> 18 chr8 131738871 FALSE >>>>>>>> 19 chr9 124076172 FALSE >>>>>>>> 20 chrM 16299 TRUE >>>>>>>> 21 chrX 166650296 FALSE >>>>>>>> 22 chrY 15902555 FALSE >>>>>>>> >>>>>>>> >>>>>>>> I ran it again without the "exonRankAttributeName" argument and >>>>>>>> I got: >>>>>>>> >>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>> + chrominfo = chrominfo, >>>>>>>> + dataSource = "UCSC", >>>>>>>> + species = "Mus musculus") >>>>>>>> extracting transcript information >>>>>>>> Estimating transcript ranges. >>>>>>>> Extracting gene IDs >>>>>>>> Processing splicing information for gtf file. >>>>>>>> Deducing exon rank from relative coordinates provided >>>>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, >>>>>>>> "transcripts$tx_chrom", >>>>>>>> : >>>>>>>> all the values in 'transcripts$tx_chrom' must be present in >>>>>>>> 'chrominfo$chrom' >>>>>>>> In addition: Warning message: >>>>>>>> In .deduceExonRankings(exs, format = "gtf") : >>>>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>>>> then please >>>>>>>> be >>>>>>>> sure that you have provided a valid attribute for >>>>>>>> exonRankAttributeName >>>>>>>> >>>>>>>> >>>>>>>> Without the "chrominfo" argument I got the same error message >>>>>>>> as the first >>>>>>>> time: >>>>>>>> >>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>> + exonRankAttributeName = "exon_number", >>>>>>>> + dataSource = "UCSC", >>>>>>>> + species = "Mus musculus") >>>>>>>> >>>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>>> >>>>>>>> >>>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>>>>>> "chrominfo" arguments it worked but the warning reminded me of the >>>>>>>> "exonRankAttributeName" argument and the chromosome names are >>>>>>>> now different >>>>>>>> from the ones in the genome file and there is no info on their >>>>>>>> length >>>>>>>> >>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>> + dataSource = "UCSC", >>>>>>>> + species = "Mus musculus") >>>>>>>> extracting transcript information >>>>>>>> Estimating transcript ranges. >>>>>>>> Extracting gene IDs >>>>>>>> Processing splicing information for gtf file. >>>>>>>> Deducing exon rank from relative coordinates provided >>>>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>>>> Now generating chrominfo from available sequence names. No >>>>>>>> chromosome >>>>>>>> length >>>>>>>> information is available. >>>>>>>> Warning messages: >>>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>>>> then please >>>>>>>> be >>>>>>>> sure that you have provided a valid attribute for >>>>>>>> exonRankAttributeName >>>>>>>> 2: In matchCircularity(chroms, circ_seqs) : >>>>>>>> None of the strings in your circ_seqs argument match your >>>>>>>> seqnames. >>>>>>>> >>>>>>>>> seqinfo(txdb) >>>>>>>> Seqinfo of length 32 >>>>>>>> seqnames seqlengths isCircular genome >>>>>>>> chr13 <na> FALSE <na> >>>>>>>> chr9 <na> FALSE <na> >>>>>>>> chr6 <na> FALSE <na> >>>>>>>> chrX <na> FALSE <na> >>>>>>>> chr17 <na> FALSE <na> >>>>>>>> chr2 <na> FALSE <na> >>>>>>>> chr7 <na> FALSE <na> >>>>>>>> chr18 <na> FALSE <na> >>>>>>>> chr8 <na> FALSE <na> >>>>>>>> ... ... ... ... >>>>>>>> chrY_random <na> FALSE <na> >>>>>>>> chrX_random <na> FALSE <na> >>>>>>>> chr5_random <na> FALSE <na> >>>>>>>> chr4_random <na> FALSE <na> >>>>>>>> chrY <na> FALSE <na> >>>>>>>> chr7_random <na> FALSE <na> >>>>>>>> chr17_random <na> FALSE <na> >>>>>>>> chr13_random <na> FALSE <na> >>>>>>>> chr1_random <na> FALSE <na> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> What am I doing wrong in my original call to >>>>>>>> makeTranscriptDbFromGFF? >>>>>>>> >>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>> exonRankAttributeName = "exon_number", >>>>>>>> chrominfo = chrominfo, >>>>>>>> dataSource = "UCSC", >>>>>>>> species = "Mus musculus") >>>>>>>> >>>>>>>> Why am I getting this unfair error message? >>>>>>>> Thank you for your help >>>>>>>> Ugo >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> FYI, this is my annFile (My gtf annotation file was downloaded >>>>>>>> together >>>>>>>> with a fasta file containing the mouse genome from UCSC): >>>>>>>> >>>>>>>>> annFile >>>>>>>> GRanges with 595632 ranges and 9 metadata columns: >>>>>>>> seqnames ranges strand | source >>>>>>>> type >>>>>>>> score phase >>>>>>>> <rle> <iranges> <rle> | <factor> >>>>>>>> <factor> >>>>>>>> <numeric> <integer> >>>>>>>> [1] chr1 [3204563, 3207049] - | unknown >>>>>>>> exon >>>>>>>> <na> <na> >>>>>>>> [2] chr1 [3206103, 3206105] - | unknown >>>>>>>> stop_codon >>>>>>>> <na> <na> >>>>>>>> [3] chr1 [3206106, 3207049] - | unknown >>>>>>>> CDS >>>>>>>> <na> 2 >>>>>>>> [4] chr1 [3411783, 3411982] - | unknown >>>>>>>> CDS >>>>>>>> <na> 1 >>>>>>>> [5] chr1 [3411783, 3411982] - | unknown >>>>>>>> exon >>>>>>>> <na> <na> >>>>>>>> ... ... ... ... ... ... >>>>>>>> ... >>>>>>>> ... ... >>>>>>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>>>>>> stop_codon >>>>>>>> <na> <na> >>>>>>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>>>>>> exon >>>>>>>> <na> <na> >>>>>>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>>>>>> CDS >>>>>>>> <na> 0 >>>>>>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>>>>>> start_codon >>>>>>>> <na> <na> >>>>>>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>>>>>> stop_codon >>>>>>>> <na> <na> >>>>>>>> gene_id transcript_id gene_name p_id >>>>>>>> tss_id >>>>>>>> <character> <character> <character> <character> >>>>>>>> <character> >>>>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>> TSS1881 >>>>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>> TSS1881 >>>>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>> TSS1881 >>>>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>> TSS1881 >>>>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>> TSS1881 >>>>>>>> ... ... ... ... ... >>>>>>>> ... >>>>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>>>>>> TSS19491 >>>>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>> TSS4342 >>>>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>> TSS4342 >>>>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>> TSS4342 >>>>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>> TSS4342 >>>>>>>> --- >>>>>>>> seqlengths: >>>>>>>> chr1 chr10 chr11 chr12 ... >>>>>>>> chrX_random >>>>>>>> chrY chrY_random >>>>>>>> NA NA NA NA ... NA >>>>>>>> NA NA >>>>>>>> >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>> methods >>>>>>>> base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>>>>>> Biostrings_2.28.0 >>>>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>>>>>> GenomicRanges_1.12.2 >>>>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>>>>>> BSgenome_1.28.0 >>>>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>>>>>> lattice_0.20-15 >>>>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>>>>>> stats4_3.0.0 >>>>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >> >> ---------------------------------------------------------------- >> This message was sent using IMP, the Internet Messaging Program. >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 4.6 years ago by Marc Carlson7.2k
Thank you very very much, Marc. Ugo > From: Marc Carlson <mcarlson at="" fhcrc.org=""> > Date: Mon, 06 May 2013 15:00:09 -0700 > To: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] MakeTranscriptDbFromGFF > > Hi Ugo, > > So I have good news and bad news regarding that data file you wanted to > use from: > > http://tophat.cbcb.umd.edu/igenomes.shtml > > The good news is that I have modified makeTranscriptDbFromGFF() to > notice whenever a file only has exon rank information for CDS and to do > the right thing (and make use of that information) while issuing a > warning that it is doing so. The bad news is that after looking more > carefully at the data from the site listed above, there is no way to > make that data into a transcriptome without inferring the exon ranks. > At least not in it's current state. The files on offer simply don't > contain exon rank information. What I initially thought might be exon > ranks is (after more careful inspection) something else, and so can't be > used for that. > > So you probably should go with the advice from my last message and make > a custom mm9 package that makes use of the UCSC transcriptome (or maybe > one from BiomaRt depending on what you have aligned your data to). But > I would not recommend using the data at that web site to make > transcriptomes unless they can give you a file that has exon rank > information. > > > Marc > > > > On 05/06/2013 01:32 PM, Marc Carlson wrote: >> Hi Ugo, >> >> This worked for me: >> >> library(OrganismDbi) >> >> gd <- list(join1 = c(GO.db="GOID", org.Mm.eg.db="GO"), >> join2 = c(org.Mm.eg.db="ENTREZID", >> TxDb.Mmusculus.UCSC.mm10.knownGene="GENEID")) >> >> makeOrganismPackage(pkgname = "Mus.musculus.mm9", >> graphData = gd, >> organism = "Mus musculus", >> version = "1.0", >> maintainer ="You <maintainer at="" someplace.org="">", >> author = "You", >> destDir = ".", >> license = "Artistic-2.0") >> >> >> Then I can do stuff like this (after I run R CMD INSTALL >> Mus.musculus.mm9) : >> >> library(Mus.musculus.mm9) >> >> select(Mus.musculus.mm9, >> head(keys(Mus.musculus.mm9,keytype="ENTREZID")),c("SYMBOL","TXSTRAN D"),keytyp >> e="ENTREZID") >> >> txs <- transcripts(Mus.musculus.mm9, columns="SYMBOL") >> >> >> Etc. >> >> >> As for your other question, that error means that the chromosome names >> in your input file are not all present in the chrominfo that you have >> used. Usually this means that there is a row that says something like >> chrom8_random (or something like that) as it's chromosome name in the >> input file. And you don't have that in your chrominfo table, so the >> database is stopping, because it doesn't have any way to know what >> chrom8_random is? >> >> To fix it, you can either remove unwanted rows from your input file or >> you can add new rows to your chrominfo object. >> >> >> Marc >> >> >> >> On 05/04/2013 07:18 AM, Ugo Borello wrote: >>> Dear Marc, >>> I apologize for bothering you again, but I was intrigued by the >>> custom Mus.musculus.mm9 object. It sounds very convenient indeed. >>> so I tried, as you suggested: >>> >>> gd<-c(org.Mm.eg.db='SYMBOL', >>> TxDb.Mmusculus.UCSC.mm9.knownGene="GENEID") >>> ## I also tried to set org.Mm.eg.db='ENTREZID' >>> >>> destination <- tempfile() >>> dir.create(destination) >>> makeOrganismPackage(pkgname = "Mus.musculus.mm9", >>> graphData = gd, >>> organism = "Mus musculus", >>> version = "1.0.0", >>> maintainer = "Package >>> Maintainer<maintainer at="" somewhere.com="">", >>> author = "SomeBody", >>> destDir = destination, >>> license = "Artistic-2.0") >>> >>> and I got: >>> Error in cbind(pkgs, keys) : >>> number of rows of matrices must match (see arg 2) >>> >>> I am surely missing something here. Any suggestions? >>> >>> >>> And, coming back to makeTranscriptDbFromGFF. >>> I obtained my gtf annotation file from UCSC using genePredToGtf, as >>> the UCSC people suggest >>> (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format) >>> >>> txdb <-makeTranscriptDbFromGFF(file = 'mm9.gtf', format = "gtf", >>> + exonRankAttributeName = "exon_number", >>> + chrominfo = chrominfo, >>> + dataSource = "UCSC", >>> + species = "Mus musculus") >>> extracting transcript information >>> Estimating transcript ranges. >>> Extracting gene IDs >>> Processing splicing information for gtf file. >>> Prepare the 'metadata' data frame ... metadata: OK >>> Error in .checkForeignKey(transcripts_tx_chrom, NA, >>> "transcripts$tx_chrom", : >>> all the values in 'transcripts$tx_chrom' must be present in >>> 'chrominfo$chrom' >>> >>> Why I get this error now? >>> >>>> chrominfo >>> chrom length is_circular >>> 1 chr10 129993255 FALSE >>> 2 chr11 121843856 FALSE >>> 3 chr12 121257530 FALSE >>> 4 chr13 120284312 FALSE >>> 5 chr14 125194864 FALSE >>> 6 chr15 103494974 FALSE >>> 7 chr16 98319150 FALSE >>> 8 chr17 95272651 FALSE >>> 9 chr18 90772031 FALSE >>> 10 chr19 61342430 FALSE >>> 11 chr1 197195432 FALSE >>> 12 chr2 181748087 FALSE >>> 13 chr3 159599783 FALSE >>> 14 chr4 155630120 FALSE >>> 15 chr5 152537259 FALSE >>> 16 chr6 149517037 FALSE >>> 17 chr7 152524553 FALSE >>> 18 chr8 131738871 FALSE >>> 19 chr9 124076172 FALSE >>> 20 chrM 16299 FALSE >>> 21 chrX 166650296 FALSE >>> 22 chrY 15902555 FALSE >>> >>> Thank you very much for your time and patience. >>> >>> Ugo >>> >>> >>> >>> >>> P.S. >>> This is the annotation >>>> annFile3<- import.gff('mm9.gtf', format='gtf', asRangedData=FALSE) >>>> annFile3 >>> GRanges with 962651 ranges and 9 metadata columns: >>> seqnames ranges strand | source >>> type score phase gene_id transcript_id >>> <rle> <iranges> <rle> | <factor> >>> <factor> <numeric> <integer> <character> <character> >>> [1] chr1 [3195985, 3197398] - | knownGene >>> exon <na> <na> uc007aet.1 uc007aet.1 >>> [2] chr1 [3203520, 3205713] - | knownGene >>> exon <na> <na> uc007aet.1 uc007aet.1 >>> [3] chr1 [3204563, 3207049] - | knownGene >>> exon <na> <na> Q5GH67 uc007aeu.1 >>> [4] chr1 [3206106, 3207049] - | knownGene >>> CDS <na> 2 Q5GH67 uc007aeu.1 >>> [5] chr1 [3411783, 3411982] - | knownGene >>> exon <na> <na> Q5GH67 uc007aeu.1 >>> ... ... ... ... ... ... >>> ... ... ... ... ... >>> [962647] chrY_random [58502132, 58502365] + | knownGene >>> CDS <na> 0 Q62458 uc012htl.1 >>> [962648] chrY_random [58502369, 58502946] + | knownGene >>> exon <na> <na> Q62458 uc012htl.1 >>> [962649] chrY_random [58502369, 58502812] + | knownGene >>> CDS <na> 0 Q62458 uc012htl.1 >>> [962650] chrY_random [58502132, 58502134] + | knownGene >>> start_codon <na> 0 Q62458 uc012htl.1 >>> [962651] chrY_random [58502813, 58502815] + | knownGene >>> stop_codon <na> 0 Q62458 uc012htl.1 >>> exon_number exon_id gene_name >>> <numeric> <character> <character> >>> [1] 1 uc007aet.1.1 <na> >>> [2] 2 uc007aet.1.2 <na> >>> [3] 1 uc007aeu.1.1 Q5GH67 >>> [4] 1 uc007aeu.1.1 Q5GH67 >>> [5] 2 uc007aeu.1.2 Q5GH67 >>> ... ... ... ... >>> [962647] 1 uc012htl.1.1 Q62458 >>> [962648] 2 uc012htl.1.2 Q62458 >>> [962649] 2 uc012htl.1.2 Q62458 >>> [962650] 1 uc012htl.1.1 Q62458 >>> [962651] 1 uc012htl.1.1 Q62458 >>> --- >>> seqlengths: >>> chr1 chr10 chr11 chr12 chr13 >>> ... chrX chrX_random chrY chrY_random >>> NA NA NA NA NA >>> ... NA >>> >>> >>> Quoting Marc Carlson <mcarlson at="" fhcrc.org="">: >>> >>>> Hi Ugo, >>>> >>>> >>>> On 05/03/2013 04:00 AM, Ugo Borello wrote: >>>>> Dear Carl, >>>>> Thank you very much; it makes sense now. >>>>> >>>>> To quantify gene expression of my RNASeq samples I use your >>>>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the >>>>> alignments to >>>>> the BSgenome.Mmusculus.UCSC.mm9 genome. >>>>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I >>>>> wanted to use their annotation. >>>>> >>>>> >>>>> Anyway, I have now a general question. >>>>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes >>>>>> genome<- scanFaIndex('genome.fa') >>>>>> seqlengths(genome) >>>>> chr10 chr11 chr12 chr13 chr14 chr15 chr16 >>>>> chr17 chr18 chr19 >>>>> 129993255 121843856 121257530 120284312 125194864 103494974 98319150 >>>>> 95272651 90772031 61342430 >>>>> chr1 chr2 chr3 chr4 chr5 chr6 chr7 >>>>> chr8 chr9 chrM >>>>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553 >>>>> 131738871 124076172 16299 >>>>> chrX chrY >>>>> 166650296 15902555 >>>>> >>>>> For your TxDb.Mmusculus.UCSC.mm9.knownGene >>>>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene) >>>>> chr1 chr2 chr3 chr4 chr5 >>>>> chr6 chr7 chr8 chr9 >>>>> 197195432 181748087 159599783 155630120 152537259 >>>>> 149517037 152524553 131738871 124076172 >>>>> chr10 chr11 chr12 chr13 chr14 >>>>> chr15 chr16 chr17 chr18 >>>>> 129993255 121843856 121257530 120284312 125194864 >>>>> 103494974 98319150 95272651 90772031 >>>>> chr19 chrX chrY chrM chr1_random >>>>> chr3_random chr4_random chr5_random chr7_random >>>>> 61342430 166650296 15902555 16299 1231697 >>>>> 41899 160594 357350 362490 >>>>> chr8_random chr9_random chr13_random chr16_random chr17_random >>>>> chrX_random chrY_random chrUn_random >>>>> 849593 449403 400311 3994 628739 >>>>> 1785075 58682461 5900358 >>>>> >>>>> >>>>> So. I want to filter out the '_random' stuff; is this the only and >>>>> right way >>>>> to do it? >>>> >>>> It depends on whether you want to use the _random stuff. My impression >>>> of how people use this is that most people don't. So they would use >>>> isActiveSeq() to toggle those off. >>>> >>>>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene >>>>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE >>>>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE, >>>>> + "chr12"=TRUE,"chr13"=TRUE, >>>>> + "chr14"=TRUE,"chr15"=TRUE, >>>>> + "chr16"=TRUE, "chr17"=TRUE, >>>>> + "chr18"=TRUE, "chr19"=TRUE, >>>>> + "chr1"=TRUE, "chr2"=TRUE, >>>>> + "chr3"=TRUE, "chr4"=TRUE, >>>>> + "chr5"=TRUE, "chr6"=TRUE, >>>>> + "chr7"=TRUE, "chr8"=TRUE, >>>>> + "chr9"=TRUE, "chrM"=TRUE, >>>>> + "chrX"=TRUE, "chrY"=TRUE) >>>>> Then get my gene info this way? >>>>>> genesInfo<- exons(ann, columns='gene_id') >>>>> How can I add also gene names or symbols? >>>> Again, it depends on what you want to do. If you are dealing with just >>>> a TranscriptDb like this, then there are not gene symbols attached >>>> already. So for that object yes, you would need to do it like that. >>>> >>>> >>>> Now if you wanted gene symbols they are pretty easy to get. You can go >>>> and get gene symbols by loading an org package like this: >>>> >>>> library(org.Mm.eg.db) >>>> cols(org.Mm.eg.db) >>>> >>>> And then you could use select() to retrieve those (by using the gene >>>> IDs as keys from your TranscriptDb object). But some people find this >>>> extra step to be inconvenient, so I have created another route. And >>>> that is to use a OrganismDb object. There is a package for this >>>> already that I will use to demo here: >>>> >>>> library(Mus.musculus) >>>> cols(Mus.musculus) >>>> >>>> You will notice that this kind of object has all the stuff you want in >>>> one place. I suspect that you will find that to be a lot more >>>> convenient: >>>> >>>> This basically means that you can do the thing you were interested in >>>> like this: >>>> >>>> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL")) >>>> >>>> *But* in your case, you are using mm9, so you will want to make a >>>> custom object (the Mus.musculus object is based on mm10). This is not >>>> hard to do, but it is an extra step. You can read how to do it in >>>> section 2 of the following vignette: >>>> >>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/Organism Dbi/inst/d >>>> oc/OrganismDbi.pdf >>>> >>>> >>>> >>>> >>>>> >>>>> >>>>> >>>>> Thank you again for your help. >>>>> >>>>> Ugo >>>>> >>>>> >>>>> >>>>> >>>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>>> Date: Thu, 02 May 2013 15:52:38 -0700 >>>>>> To: Ugo Borello <ugo.borello at="" inserm.fr=""> >>>>>> Cc: <bioconductor at="" r-project.org=""> >>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>>> >>>>>> Hi Ugo, >>>>>> >>>>>> The 15 GB tarball you sent me to contains several different GTF files >>>>>> for genes. I grabbed this one as it seemed to be the most recent: >>>>>> >>>>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15 -01-24/Gen >>>>>> es/ge >>>>>> >>>>>> nes.gtf >>>>>> >>>>>> So looking at this file I can reproduce the problem you >>>>>> mentioned. And >>>>>> it shows me two problems. The 1st problem is that the only field >>>>>> that >>>>>> seems to contain any information about exon positions is called phase >>>>>> (and not "exon_number" as was in the code I see from before). >>>>>> There is >>>>>> not actually any field called "exon_number" in this file. Either way, >>>>>> one thing you can check is to make sure that the string you give >>>>>> here is >>>>>> the same as the appropriate field name that is used by the file. >>>>>> There >>>>>> is no way to know this information in advance since GTF does not >>>>>> specify >>>>>> how to encode this information (and in fact the information is >>>>>> entirely >>>>>> optional). >>>>>> >>>>>> The second problem is that even "phase" can't work right now since >>>>>> the >>>>>> authors of this gtf file have decided to only associate the exon rank >>>>>> information only with CDS and never with exons features. So there is >>>>>> not any actual 'exon' position information in this file, only >>>>>> information for CDS positions. Now that I see people doing these >>>>>> files >>>>>> in this way, I plan to enhance the parser so that it can process >>>>>> files >>>>>> of this kind. >>>>>> >>>>>> >>>>>> Is there a reason why you wanted to use this file and not the data >>>>>> contained in this package here? >>>>>> >>>>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/ TxDb.Mmusc >>>>>> ulus. >>>>>> >>>>>> UCSC.mm9.knownGene.html >>>>>> >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> >>>>>> On 05/02/2013 02:59 AM, Ugo Borello wrote: >>>>>>> Dear Marc >>>>>>> Sorry I was not precise on the origin of the gtf annotation file; >>>>>>> I got the >>>>>>> gtf file from here: >>>>>>> http://tophat.cbcb.umd.edu/igenomes.shtml >>>>>>> >>>>>>> And more precisely from the Mus musculus/UCSC/mm9 folder >>>>>>> Here the description of the content of the folder: >>>>>>> ftp://igenome:G3nom3s4u at ussd-ftp.illumina.com/README.txt >>>>>>> >>>>>>> I realized reading the README.txt file that actually the >>>>>>> genes.gtf file I >>>>>>> used is the Ensembl annotation of the mm9 release. >>>>>>> So, I changed dataSource = "Ensembl" in the function call and I >>>>>>> got the same >>>>>>> error message: >>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>> >>>>>>> At the end of my previous email you have the result of calling: >>>>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE) >>>>>>> >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Ugo >>>>>>> >>>>>>>> From: Marc Carlson <mcarlson at="" fhcrc.org=""> >>>>>>>> Date: Wed, 01 May 2013 15:33:02 -0700 >>>>>>>> To: <bioconductor at="" r-project.org=""> >>>>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF >>>>>>>> >>>>>>>> Hi Ugo, >>>>>>>> >>>>>>>> Which UCSC file was it that you were trying to process? >>>>>>>> >>>>>>>> >>>>>>>> Marc >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote: >>>>>>>>> Good morning, >>>>>>>>> >>>>>>>>> I have a little problem creating a TranscriptDb object using >>>>>>>>> the function >>>>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count >>>>>>>>> the >>>>>>>>> overlaps >>>>>>>>> of my genomic alignments with genes. >>>>>>>>> >>>>>>>>> >>>>>>>>> I ran: >>>>>>>>> >>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>>> + exonRankAttributeName = "exon_number", >>>>>>>>> + chrominfo = chrominfo, >>>>>>>>> + dataSource = "UCSC", >>>>>>>>> + species = "Mus musculus") >>>>>>>>> >>>>>>>>> And I got this error message: >>>>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> "chrominfo" was (info retrieved from the fasta genome file): >>>>>>>>> >>>>>>>>>> chrominfo >>>>>>>>> chrom length is_circular >>>>>>>>> 1 chr10 129993255 FALSE >>>>>>>>> 2 chr11 121843856 FALSE >>>>>>>>> 3 chr12 121257530 FALSE >>>>>>>>> 4 chr13 120284312 FALSE >>>>>>>>> 5 chr14 125194864 FALSE >>>>>>>>> 6 chr15 103494974 FALSE >>>>>>>>> 7 chr16 98319150 FALSE >>>>>>>>> 8 chr17 95272651 FALSE >>>>>>>>> 9 chr18 90772031 FALSE >>>>>>>>> 10 chr19 61342430 FALSE >>>>>>>>> 11 chr1 197195432 FALSE >>>>>>>>> 12 chr2 181748087 FALSE >>>>>>>>> 13 chr3 159599783 FALSE >>>>>>>>> 14 chr4 155630120 FALSE >>>>>>>>> 15 chr5 152537259 FALSE >>>>>>>>> 16 chr6 149517037 FALSE >>>>>>>>> 17 chr7 152524553 FALSE >>>>>>>>> 18 chr8 131738871 FALSE >>>>>>>>> 19 chr9 124076172 FALSE >>>>>>>>> 20 chrM 16299 TRUE >>>>>>>>> 21 chrX 166650296 FALSE >>>>>>>>> 22 chrY 15902555 FALSE >>>>>>>>> >>>>>>>>> >>>>>>>>> I ran it again without the "exonRankAttributeName" argument and >>>>>>>>> I got: >>>>>>>>> >>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>>> + chrominfo = chrominfo, >>>>>>>>> + dataSource = "UCSC", >>>>>>>>> + species = "Mus musculus") >>>>>>>>> extracting transcript information >>>>>>>>> Estimating transcript ranges. >>>>>>>>> Extracting gene IDs >>>>>>>>> Processing splicing information for gtf file. >>>>>>>>> Deducing exon rank from relative coordinates provided >>>>>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, >>>>>>>>> "transcripts$tx_chrom", >>>>>>>>> : >>>>>>>>> all the values in 'transcripts$tx_chrom' must be present in >>>>>>>>> 'chrominfo$chrom' >>>>>>>>> In addition: Warning message: >>>>>>>>> In .deduceExonRankings(exs, format = "gtf") : >>>>>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>>>>> then please >>>>>>>>> be >>>>>>>>> sure that you have provided a valid attribute for >>>>>>>>> exonRankAttributeName >>>>>>>>> >>>>>>>>> >>>>>>>>> Without the "chrominfo" argument I got the same error message >>>>>>>>> as the first >>>>>>>>> time: >>>>>>>>> >>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>>> + exonRankAttributeName = "exon_number", >>>>>>>>> + dataSource = "UCSC", >>>>>>>>> + species = "Mus musculus") >>>>>>>>> >>>>>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>>>>> arguments imply differing number of rows: 541775, 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the >>>>>>>>> "chrominfo" arguments it worked but the warning reminded me of the >>>>>>>>> "exonRankAttributeName" argument and the chromosome names are >>>>>>>>> now different >>>>>>>>> from the ones in the genome file and there is no info on their >>>>>>>>> length >>>>>>>>> >>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>>> + dataSource = "UCSC", >>>>>>>>> + species = "Mus musculus") >>>>>>>>> extracting transcript information >>>>>>>>> Estimating transcript ranges. >>>>>>>>> Extracting gene IDs >>>>>>>>> Processing splicing information for gtf file. >>>>>>>>> Deducing exon rank from relative coordinates provided >>>>>>>>> Prepare the 'metadata' data frame ... metadata: OK >>>>>>>>> Now generating chrominfo from available sequence names. No >>>>>>>>> chromosome >>>>>>>>> length >>>>>>>>> information is available. >>>>>>>>> Warning messages: >>>>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") : >>>>>>>>> Infering Exon Rankings. If this is not what you expected, >>>>>>>>> then please >>>>>>>>> be >>>>>>>>> sure that you have provided a valid attribute for >>>>>>>>> exonRankAttributeName >>>>>>>>> 2: In matchCircularity(chroms, circ_seqs) : >>>>>>>>> None of the strings in your circ_seqs argument match your >>>>>>>>> seqnames. >>>>>>>>> >>>>>>>>>> seqinfo(txdb) >>>>>>>>> Seqinfo of length 32 >>>>>>>>> seqnames seqlengths isCircular genome >>>>>>>>> chr13 <na> FALSE <na> >>>>>>>>> chr9 <na> FALSE <na> >>>>>>>>> chr6 <na> FALSE <na> >>>>>>>>> chrX <na> FALSE <na> >>>>>>>>> chr17 <na> FALSE <na> >>>>>>>>> chr2 <na> FALSE <na> >>>>>>>>> chr7 <na> FALSE <na> >>>>>>>>> chr18 <na> FALSE <na> >>>>>>>>> chr8 <na> FALSE <na> >>>>>>>>> ... ... ... ... >>>>>>>>> chrY_random <na> FALSE <na> >>>>>>>>> chrX_random <na> FALSE <na> >>>>>>>>> chr5_random <na> FALSE <na> >>>>>>>>> chr4_random <na> FALSE <na> >>>>>>>>> chrY <na> FALSE <na> >>>>>>>>> chr7_random <na> FALSE <na> >>>>>>>>> chr17_random <na> FALSE <na> >>>>>>>>> chr13_random <na> FALSE <na> >>>>>>>>> chr1_random <na> FALSE <na> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> What am I doing wrong in my original call to >>>>>>>>> makeTranscriptDbFromGFF? >>>>>>>>> >>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf", >>>>>>>>> exonRankAttributeName = "exon_number", >>>>>>>>> chrominfo = chrominfo, >>>>>>>>> dataSource = "UCSC", >>>>>>>>> species = "Mus musculus") >>>>>>>>> >>>>>>>>> Why am I getting this unfair error message? >>>>>>>>> Thank you for your help >>>>>>>>> Ugo >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> FYI, this is my annFile (My gtf annotation file was downloaded >>>>>>>>> together >>>>>>>>> with a fasta file containing the mouse genome from UCSC): >>>>>>>>> >>>>>>>>>> annFile >>>>>>>>> GRanges with 595632 ranges and 9 metadata columns: >>>>>>>>> seqnames ranges strand | source >>>>>>>>> type >>>>>>>>> score phase >>>>>>>>> <rle> <iranges> <rle> | <factor> >>>>>>>>> <factor> >>>>>>>>> <numeric> <integer> >>>>>>>>> [1] chr1 [3204563, 3207049] - | unknown >>>>>>>>> exon >>>>>>>>> <na> <na> >>>>>>>>> [2] chr1 [3206103, 3206105] - | unknown >>>>>>>>> stop_codon >>>>>>>>> <na> <na> >>>>>>>>> [3] chr1 [3206106, 3207049] - | unknown >>>>>>>>> CDS >>>>>>>>> <na> 2 >>>>>>>>> [4] chr1 [3411783, 3411982] - | unknown >>>>>>>>> CDS >>>>>>>>> <na> 1 >>>>>>>>> [5] chr1 [3411783, 3411982] - | unknown >>>>>>>>> exon >>>>>>>>> <na> <na> >>>>>>>>> ... ... ... ... ... ... >>>>>>>>> ... >>>>>>>>> ... ... >>>>>>>>> [595628] chrY_random [54422360, 54422362] + | unknown >>>>>>>>> stop_codon >>>>>>>>> <na> <na> >>>>>>>>> [595629] chrY_random [58501955, 58502946] + | unknown >>>>>>>>> exon >>>>>>>>> <na> <na> >>>>>>>>> [595630] chrY_random [58502132, 58502812] + | unknown >>>>>>>>> CDS >>>>>>>>> <na> 0 >>>>>>>>> [595631] chrY_random [58502132, 58502134] + | unknown >>>>>>>>> start_codon >>>>>>>>> <na> <na> >>>>>>>>> [595632] chrY_random [58502813, 58502815] + | unknown >>>>>>>>> stop_codon >>>>>>>>> <na> <na> >>>>>>>>> gene_id transcript_id gene_name p_id >>>>>>>>> tss_id >>>>>>>>> <character> <character> <character> <character> >>>>>>>>> <character> >>>>>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>>> TSS1881 >>>>>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>>> TSS1881 >>>>>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>>> TSS1881 >>>>>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>>> TSS1881 >>>>>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739 >>>>>>>>> TSS1881 >>>>>>>>> ... ... ... ... ... >>>>>>>>> ... >>>>>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 >>>>>>>>> TSS19491 >>>>>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>>> TSS4342 >>>>>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>>> TSS4342 >>>>>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>>> TSS4342 >>>>>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 >>>>>>>>> TSS4342 >>>>>>>>> --- >>>>>>>>> seqlengths: >>>>>>>>> chr1 chr10 chr11 chr12 ... >>>>>>>>> chrX_random >>>>>>>>> chrY chrY_random >>>>>>>>> NA NA NA NA ... NA >>>>>>>>> NA NA >>>>>>>>> >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>> methods >>>>>>>>> base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2 >>>>>>>>> Biostrings_2.28.0 >>>>>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 >>>>>>>>> GenomicRanges_1.12.2 >>>>>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5 >>>>>>>>> BSgenome_1.28.0 >>>>>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3 >>>>>>>>> lattice_0.20-15 >>>>>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0 >>>>>>>>> stats4_3.0.0 >>>>>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>> >>> >>> ---------------------------------------------------------------- >>> This message was sent using IMP, the Internet Messaging Program. >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 4.6 years ago by Ugo Borello340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 341 users visited in the last hour