Question

Working with ensembl 73 & BSGenome

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.8 years ago

Hi, I'm working with the latest annotation set from Ensembl (ens73) which is based on the patched GRCh37.p12 assembly. I have retrieved the transcript set from Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart(). One of the things I'd like to do is create a DNAStringSet of sequences for all the transcripts in my transcriptDB using the GenomicFeatures:extractTranscriptsFromGenome() function. This takes a TDB and a BSGenomes object as input. However, the latest BSGenomes available for the human is UCSC.hg19, which is unpatched. When I run the command, I get the error: Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], : sequence ^1$ not found I'm pretty sure this is because the transcriptDB contains sequences (patches/scaffolds) that are present in the patched assembly but not the base GRCh37 assembly. Additionally the nomenclature is different between UCSC and Ensembl (e.g. chr1 ; 1). I see a few options here. One obvious one would be to stick with UCSC hg19 and use the UCSC ensGene table, but others in my working group are using ens73 so this is a suboptimal solution. Is there an updated BSGenome available for GRCh37.p12, or an easy way to forge one? Have others encountered this issue? Thanks! Tim Johnstone -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grDevices datasets splines tcltk utils parallel stats graphics methods [10] base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BiocInstaller_1.12.0 [3] data.table_1.8.10 Hmisc_3.12-2 [5] Formula_1.1-1 survival_2.37-4 [7] plyr_1.8 gdata_2.13.2 [9] ShortRead_1.20.0 lattice_0.20-24 [11] rtracklayer_1.22.0 Rsamtools_1.14.1 [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0 [15] Biostrings_2.30.0 lessR_2.9.7 [17] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 [19] Biobase_2.22.0 GenomicRanges_1.14.3 [21] XVector_0.2.0 IRanges_1.20.4 [23] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] biomaRt_2.18.0 bitops_1.0-6 car_2.0-19 cluster_1.14.4 [5] DBI_0.2-7 foreign_0.8-57 grid_3.0.2 gtools_3.1.0 [9] hwriter_1.3 latticeExtra_0.6-26 leaps_2.9 MASS_7.3-29 [13] MBESS_3.3.3 nnet_7.3-7 RColorBrewer_1.0-5 RCurl_1.95-4.1 [17] rpart_4.1-3 RSQLite_0.11.4 stats4_3.0.2 tools_3.0.2 [21] XML_3.95-0.2 zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.

Annotation BSgenome TranscriptDb biomaRt BSgenome Annotation BSgenome TranscriptDb biomaRt • 3.8k views

ADD COMMENT • link updated 11.7 years ago by Hervé Pagès 16k • written 11.7 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2013-10-30

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Tim, On 10/30/2013 09:39 PM, Timothy Johnstone [guest] wrote: > Hi, > I'm working with the latest annotation set from Ensembl (ens73) which is based on the patched GRCh37.p12 assembly. I have retrieved the transcript set from Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart(). > > One of the things I'd like to do is create a DNAStringSet of sequences for all the transcripts in my transcriptDB using the GenomicFeatures:extractTranscriptsFromGenome() function. This takes a TDB and a BSGenomes object as input. However, the latest BSGenomes available for the human is UCSC.hg19, which is unpatched. When I run the command, I get the error: > Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], : > sequence ^1$ not found > > I'm pretty sure this is because the transcriptDB contains sequences (patches/scaffolds) that are present in the patched assembly but not the base GRCh37 assembly. Additionally the nomenclature is different between UCSC and Ensembl (e.g. chr1 ; 1). Based on the GenBank ID (or RefSeq ID) of the 24 chromosomes listed here: http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.1/ http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.13/ the GRCh37 primary assembly (i.e. the exact sequence of the 24 chromosomes) has not changed between GRCh37 and GRCh37.p12. Also according to the note here: http://genome.ucsc.edu/cgi- bin/hgGateway?clade=mammal&org=Human&db=hg19 chrM is not the same in GRCh37 as in the UCSC hg19 assembly. So if you want to use Ensembl 73 (based on GRCh37.p12) to extract the transcript sequences from a BSgenome package, and if restricting this extraction to the 24 chromosomes is OK with you, then you can use the BSgenome.Hsapiens.UCSC.hg19 package. But since Ensembl and UCSC use different chromosome naming conventions, you will have to adjust the chromosome names. Here is how to proceed: library(GenomicFeatures) txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl") ex_by_tx <- exonsBy(txdb, by="tx") seqlevels(ex_by_tx, force=TRUE) <- c(1:22, "X", "Y") # restrict seqlevels(ex_by_tx) <- paste0("chr", seqlevels(ex_by_tx)) # rename library(BSgenome.Hsapiens.UCSC.hg19) tx_seqs <- extractTranscriptsFromGenome(Hsapiens, ex_by_tx) # extract If you want to extract all the transcripts reported in Ensembl 73 (i.e. not only the 195381 transcripts located on the 24 chromosomes but also the 18913 transcripts located on the alt loci and unplaced contigs), then you'll need to forge a BSGenome data package for GRCh37.p12. Look at the BSgenomeForge vignette in the BSgenome software package for how to do this. It looks like one extra difficulty with GRCh37.p12 (and previous GRCh37 patches) is that there doesn't seem to be an easy way to download the complete assembly from NCBI as FASTA file(s). Last time I checked (was around GRCh37.p7), it seemed that only the diff information was available in some kind of specialized file format so my impression was that one had to download the diff and then use it to "patch" the GRCh37 sequences in order to produce the GRCh37.p12 FASTA sequences. All this didn't seem completely straightforward to me but I didn't really spend enough time to know. Ideally, we should improve the BSgenome infrastructure so that we can also follow that diff model i.e. it would be great if we were able to forge light-weight BSgenome data packages that only contain the diff against another BSgenome data package. For example BSgenome.Hsapiens.GRCh37.p12 could be forged as a light-weight package that only contains the diff against (and depends on) BSgenome.Hsapiens.GRCh37, which would be a heavy-weight package (i.e. it contains the full sequences). From a user point of view there would be no difference between light-weight and heavy-weight, that is, both would contain a BSgenome object that looks like it contains a full assembly. But from a data management point of view that would make a big difference as we would then be able to keep up with the pace of GRCh37 patches by providing one BSgenome data package for each of them at a low cost. Cheers, H. > > I see a few options here. One obvious one would be to stick with UCSC hg19 and use the UCSC ensGene table, but others in my working group are using ens73 so this is a suboptimal solution. Is there an updated BSGenome available for GRCh37.p12, or an easy way to forge one? Have others encountered this issue? > > Thanks! > Tim Johnstone > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grDevices datasets splines tcltk utils parallel stats graphics methods > [10] base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BiocInstaller_1.12.0 > [3] data.table_1.8.10 Hmisc_3.12-2 > [5] Formula_1.1-1 survival_2.37-4 > [7] plyr_1.8 gdata_2.13.2 > [9] ShortRead_1.20.0 lattice_0.20-24 > [11] rtracklayer_1.22.0 Rsamtools_1.14.1 > [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0 > [15] Biostrings_2.30.0 lessR_2.9.7 > [17] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 > [19] Biobase_2.22.0 GenomicRanges_1.14.3 > [21] XVector_0.2.0 IRanges_1.20.4 > [23] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 bitops_1.0-6 car_2.0-19 cluster_1.14.4 > [5] DBI_0.2-7 foreign_0.8-57 grid_3.0.2 gtools_3.1.0 > [9] hwriter_1.3 latticeExtra_0.6-26 leaps_2.9 MASS_7.3-29 > [13] MBESS_3.3.3 nnet_7.3-7 RColorBrewer_1.0-5 RCurl_1.95-4.1 > [17] rpart_4.1-3 RSQLite_0.11.4 stats4_3.0.2 tools_3.0.2 > [21] XML_3.95-0.2 zlibbioc_1.8.0 > > -- > Sent via the guest posting facility at bioconductor.org. > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.7 years ago Hervé Pagès 16k

0

Entering edit mode

On 10/30/2013 11:54 PM, Hervé Pagès wrote: > Hi Tim, > > On 10/30/2013 09:39 PM, Timothy Johnstone [guest] wrote: >> Hi, >> I'm working with the latest annotation set from Ensembl (ens73) which is based >> on the patched GRCh37.p12 assembly. I have retrieved the transcript set from >> Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart(). >> >> One of the things I'd like to do is create a DNAStringSet of sequences for all >> the transcripts in my transcriptDB using the >> GenomicFeatures:extractTranscriptsFromGenome() function. This takes a TDB and >> a BSGenomes object as input. However, the latest BSGenomes available for the >> human is UCSC.hg19, which is unpatched. When I run the command, I get the error: >> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], : >> sequence ^1$ not found >> >> I'm pretty sure this is because the transcriptDB contains sequences >> (patches/scaffolds) that are present in the patched assembly but not the base >> GRCh37 assembly. Additionally the nomenclature is different between UCSC and >> Ensembl (e.g. chr1 ; 1). > > Based on the GenBank ID (or RefSeq ID) of the 24 chromosomes listed > here: > > http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.1/ > http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.13/ > > the GRCh37 primary assembly (i.e. the exact sequence of the 24 > chromosomes) has not changed between GRCh37 and GRCh37.p12. > > Also according to the note here: > > http://genome.ucsc.edu/cgi- bin/hgGateway?clade=mammal&org=Human&db=hg19 > > chrM is not the same in GRCh37 as in the UCSC hg19 assembly. > > So if you want to use Ensembl 73 (based on GRCh37.p12) to extract the > transcript sequences from a BSgenome package, and if restricting this > extraction to the 24 chromosomes is OK with you, then you can use the > BSgenome.Hsapiens.UCSC.hg19 package. But since Ensembl and UCSC use > different chromosome naming conventions, you will have to adjust the > chromosome names. Here is how to proceed: > > library(GenomicFeatures) > txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", > dataset="hsapiens_gene_ensembl") > ex_by_tx <- exonsBy(txdb, by="tx") > seqlevels(ex_by_tx, force=TRUE) <- c(1:22, "X", "Y") # restrict > seqlevels(ex_by_tx) <- paste0("chr", seqlevels(ex_by_tx)) # rename > > library(BSgenome.Hsapiens.UCSC.hg19) > tx_seqs <- extractTranscriptsFromGenome(Hsapiens, ex_by_tx) # extract Herve -- is there a way to use the fasta file from Ensembl? Download, compress and index it using Rsamtools::razip and Rsamtools::indexFa. The data source is http://www.ensembl.org/info/data/ftp/index.html Then use FaFile to reference the file and getSeq,FaFile-method to get sequences. library(Rsamtools) fa = FaFile("/path/to/GRCh37.72.dna.toplevel.fa.rz") and then I think library(GenomicFeatures) dnaList = lapply(seqlevels(ex_by_tx), function(lvl) { ## getSeq on the FaFile ## extractTranscripts with seq and ex_by_tx } ## munge to dnaList to DNAStringSet > > If you want to extract all the transcripts reported in Ensembl 73 (i.e. > not only the 195381 transcripts located on the 24 chromosomes but also > the 18913 transcripts located on the alt loci and unplaced contigs), > then you'll need to forge a BSGenome data package for GRCh37.p12. Look > at the BSgenomeForge vignette in the BSgenome software package for how > to do this. It looks like one extra difficulty with GRCh37.p12 (and > previous GRCh37 patches) is that there doesn't seem to be an easy way > to download the complete assembly from NCBI as FASTA file(s). Last > time I checked (was around GRCh37.p7), it seemed that only the diff > information was available in some kind of specialized file format so > my impression was that one had to download the diff and then use it > to "patch" the GRCh37 sequences in order to produce the GRCh37.p12 > FASTA sequences. All this didn't seem completely straightforward to me > but I didn't really spend enough time to know. > > Ideally, we should improve the BSgenome infrastructure so that > we can also follow that diff model i.e. it would be great if we were Or use AnnotationHub to retrieve the fasta and gtf files and smooth out the approach above? AnnotationHub is lagging a little as we try to improve the behind-the-scenes infrastructure, but the indexed FaFile (and gtf) for release 72 is available with library(AnnotationHub) hub = AnnotationHub() fa = hub$ensembl.release.72.fasta.homo_sapiens.dna.Homo_sapiens.GRCh37.72.d na.toplevel.fa.rz gtf = hub$ensembl.release.72.gtf.homo_sapiens.Homo_sapiens.GRCh37.72.gtf_0.0 .1.RData Martin > able to forge light-weight BSgenome data packages that only contain > the diff against another BSgenome data package. For example > BSgenome.Hsapiens.GRCh37.p12 could be forged as a light-weight > package that only contains the diff against (and depends on) > BSgenome.Hsapiens.GRCh37, which would be a heavy-weight package > (i.e. it contains the full sequences). From a user point of view > there would be no difference between light-weight and heavy-weight, > that is, both would contain a BSgenome object that looks like it > contains a full assembly. But from a data management point of view > that would make a big difference as we would then be able to keep > up with the pace of GRCh37 patches by providing one BSgenome data > package for each of them at a low cost. > > Cheers, > H. > > >> >> I see a few options here. One obvious one would be to stick with UCSC hg19 and >> use the UCSC ensGene table, but others in my working group are using ens73 so >> this is a suboptimal solution. Is there an updated BSGenome available for >> GRCh37.p12, or an easy way to forge one? Have others encountered this issue? >> >> Thanks! >> Tim Johnstone >> >> -- output of sessionInfo(): >> >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] grDevices datasets splines tcltk utils parallel stats >> graphics methods >> [10] base >> >> other attached packages: >> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BiocInstaller_1.12.0 >> [3] data.table_1.8.10 Hmisc_3.12-2 >> [5] Formula_1.1-1 survival_2.37-4 >> [7] plyr_1.8 gdata_2.13.2 >> [9] ShortRead_1.20.0 lattice_0.20-24 >> [11] rtracklayer_1.22.0 Rsamtools_1.14.1 >> [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0 >> [15] Biostrings_2.30.0 lessR_2.9.7 >> [17] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 >> [19] Biobase_2.22.0 GenomicRanges_1.14.3 >> [21] XVector_0.2.0 IRanges_1.20.4 >> [23] BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.18.0 bitops_1.0-6 car_2.0-19 cluster_1.14.4 >> [5] DBI_0.2-7 foreign_0.8-57 grid_3.0.2 gtools_3.1.0 >> [9] hwriter_1.3 latticeExtra_0.6-26 leaps_2.9 MASS_7.3-29 >> [13] MBESS_3.3.3 nnet_7.3-7 RColorBrewer_1.0-5 RCurl_1.95-4.1 >> [17] rpart_4.1-3 RSQLite_0.11.4 stats4_3.0.2 tools_3.0.2 >> [21] XML_3.95-0.2 zlibbioc_1.8.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

On 10/31/2013 01:33 AM, Martin Morgan wrote: > On 10/30/2013 11:54 PM, Hervé Pagès wrote: >> Hi Tim, >> >> On 10/30/2013 09:39 PM, Timothy Johnstone [guest] wrote: >>> Hi, >>> I'm working with the latest annotation set from Ensembl (ens73) which >>> is based >>> on the patched GRCh37.p12 assembly. I have retrieved the transcript >>> set from >>> Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart(). >>> >>> One of the things I'd like to do is create a DNAStringSet of >>> sequences for all >>> the transcripts in my transcriptDB using the >>> GenomicFeatures:extractTranscriptsFromGenome() function. This takes a >>> TDB and >>> a BSGenomes object as input. However, the latest BSGenomes available >>> for the >>> human is UCSC.hg19, which is unpatched. When I run the command, I get >>> the error: >>> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], >>> start[i], : >>> sequence ^1$ not found >>> >>> I'm pretty sure this is because the transcriptDB contains sequences >>> (patches/scaffolds) that are present in the patched assembly but not >>> the base >>> GRCh37 assembly. Additionally the nomenclature is different between >>> UCSC and >>> Ensembl (e.g. chr1 ; 1). >> >> Based on the GenBank ID (or RefSeq ID) of the 24 chromosomes listed >> here: >> >> http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.1/ >> http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.13/ >> >> the GRCh37 primary assembly (i.e. the exact sequence of the 24 >> chromosomes) has not changed between GRCh37 and GRCh37.p12. >> >> Also according to the note here: >> >> >> http://genome.ucsc.edu/cgi- bin/hgGateway?clade=mammal&org=Human&db=hg19 >> >> chrM is not the same in GRCh37 as in the UCSC hg19 assembly. >> >> So if you want to use Ensembl 73 (based on GRCh37.p12) to extract the >> transcript sequences from a BSgenome package, and if restricting this >> extraction to the 24 chromosomes is OK with you, then you can use the >> BSgenome.Hsapiens.UCSC.hg19 package. But since Ensembl and UCSC use >> different chromosome naming conventions, you will have to adjust the >> chromosome names. Here is how to proceed: >> >> library(GenomicFeatures) >> txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", >> dataset="hsapiens_gene_ensembl") >> ex_by_tx <- exonsBy(txdb, by="tx") >> seqlevels(ex_by_tx, force=TRUE) <- c(1:22, "X", "Y") # restrict >> seqlevels(ex_by_tx) <- paste0("chr", seqlevels(ex_by_tx)) # rename >> >> library(BSgenome.Hsapiens.UCSC.hg19) >> tx_seqs <- extractTranscriptsFromGenome(Hsapiens, ex_by_tx) # extract > > Herve -- is there a way to use the fasta file from Ensembl? Download, > compress and index it using Rsamtools::razip and Rsamtools::indexFa. The > data source is > > http://www.ensembl.org/info/data/ftp/index.html > > Then use FaFile to reference the file and getSeq,FaFile-method to get > sequences. > > library(Rsamtools) > fa = FaFile("/path/to/GRCh37.72.dna.toplevel.fa.rz") > > and then I think > > library(GenomicFeatures) > dnaList = lapply(seqlevels(ex_by_tx), function(lvl) { > ## getSeq on the FaFile > ## extractTranscripts with seq and ex_by_tx > } > ## munge to dnaList to DNAStringSet Thanks Martin for the pointer to the Ensembl FASTA files. Sure the "getSeq + extractTranscripts" road is worth exploring. Maybe a simpler road though is the "scanFa + extractTranscripts" road. With the latter, the main loop loads the full DNA sequences one at a time from the FaFile object and calls extractTranscripts() on it. The advanced capabilities of getSeq() are not needed for this and lower level scanFa() is enough. The "getSeq + extractTranscripts" road could try to minimize I/O cost and memory usage by loading only the transcript regions from each sequence. This is appealing but could make the code a little bit more complicated. Ideally, it sounds like this complexity should be handled by something like an "extractTranscripts" method for FaFile objects (and extractTranscriptsFromGenome() should become the "extractTranscripts" method for BSgenome objects) but we don't have this yet. So only exploring the simpler "scanFa + extractTranscripts" road for now. Some bumps I ran into while exploring this road: First it's hard to know which Ensembl FASTA file is most appropriate. For example with the Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz file (seems like a good candidate) I ran into the following gotchas: - The file contains only 287 sequences while the 'txdb' object obtained with makeTranscriptDbFromBiomart() reports transcripts on 648 distinct sequences. After some investigation I found that 393 of these sequences are LRG sequences which are not considered "top-level" and thus not included in the Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz file. - One surprise is that even though the FASTA file uses the same sequence names as 'txdb', those names are just the prefixes of long and complicated record headers. For example, for chromosome 1, 2, and patch HG1007_PATCH: > 1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 REF > 2 dna:chromosome chromosome:GRCh37:2:1:243199373:1 REF > HG1007_PATCH dna:chromosome chromosome:GRCh37:HG1007_PATCH:243059660:243125680:1 PATCH_FIX However this was just a surprise but not an issue at all when accessing this file thru a FaFile object: library(Rsamtools) file_path <- "Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz" indexFa(file_path) # the file is huge (1GB compressed, 30GB uncompressed) # so creating the index takes a while fafile <- FaFile(file_path) Then: > seqlengths(fafile)[c(1, 12, 23)] 1 2 HG1007_PATCH 249250621 243199373 243135680 Everything looks good :-) - Finally, a show stopper is the following error I get while trying to load chromosome 6 or any part of it or anything located after chromosome 6 in the file: > scanFa(fafile, param=GRanges("6", IRanges(1, 20))) Error in value[[3L]](cond) : key 62 (char '>') not in lookup table file: Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz > scanFa(fafile, param=GRanges("7", IRanges(1, 20))) Error in value[[3L]](cond) : key 62 (char '>') not in lookup table file: Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz Loading the chromosomes located before chromosome 6 (e.g. chrom 1-5 and 10-22) seems to work. Anyway, here is what the "scanFa + extractTranscripts" road looks like: library(Rsamtools) file_path <- "Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz" indexFa(file_path) fafile <- FaFile(file_path) ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE) shared_seqlevels <- intersect(seqlevels(fafile), seqlevels(ex_by_tx)) ## The loop will work fine until it chokes on chromosome 6. txseqs_per_seqlevel <- lapply(shared_seqlevels, function(seqlevel) { ## Load the sequence from the FaFile object. param <- GRanges(seqlevel, IRanges(1, seqlengths(fafile)[[seqlevel]])) seq <- scanFa(fafile, param=param)[[1L]] ## Keep only genomic ranges of transcripts on the current sequence. ex_by_tx2 <- ex_by_tx seqlevels(ex_by_tx2, force=TRUE) <- seqlevel ## We need to filter out transcripts made of exons that come from both ## strands because extractTranscripts() doesn't support them. nrun <- elementLengths(runLength(strand(ex_by_tx2))) ex_by_tx2 <- ex_by_tx2[which(nrun == 1)] ## Call extractTranscripts(). exonStarts <- start(ex_by_tx2) exonEnds <- end(ex_by_tx2) strand <- unlist(runValue(strand(ex_by_tx2))) txseqs <- extractTranscripts(seq, exonStarts, exonEnds, strand) names(txseqs) <- names(ex_by_tx2) txseqs }) ## Put together all the transcript sequences. txseqs <- do.call("c", txseqs_per_seqlevel) Cheers, H. > > > >> >> If you want to extract all the transcripts reported in Ensembl 73 (i.e. >> not only the 195381 transcripts located on the 24 chromosomes but also >> the 18913 transcripts located on the alt loci and unplaced contigs), >> then you'll need to forge a BSGenome data package for GRCh37.p12. Look >> at the BSgenomeForge vignette in the BSgenome software package for how >> to do this. It looks like one extra difficulty with GRCh37.p12 (and >> previous GRCh37 patches) is that there doesn't seem to be an easy way >> to download the complete assembly from NCBI as FASTA file(s). Last >> time I checked (was around GRCh37.p7), it seemed that only the diff >> information was available in some kind of specialized file format so >> my impression was that one had to download the diff and then use it >> to "patch" the GRCh37 sequences in order to produce the GRCh37.p12 >> FASTA sequences. All this didn't seem completely straightforward to me >> but I didn't really spend enough time to know. >> >> Ideally, we should improve the BSgenome infrastructure so that >> we can also follow that diff model i.e. it would be great if we were > > Or use AnnotationHub to retrieve the fasta and gtf files and smooth out > the approach above? AnnotationHub is lagging a little as we try to > improve the behind-the-scenes infrastructure, but the indexed FaFile > (and gtf) for release 72 is available with > > library(AnnotationHub) > hub = AnnotationHub() > fa = > hub$ensembl.release.72.fasta.homo_sapiens.dna.Homo_sapiens.GRCh37.72 .dna.toplevel.fa.rz > > gtf = > hub$ensembl.release.72.gtf.homo_sapiens.Homo_sapiens.GRCh37.72.gtf_0 .0.1.RData > > > Martin > >> able to forge light-weight BSgenome data packages that only contain >> the diff against another BSgenome data package. For example >> BSgenome.Hsapiens.GRCh37.p12 could be forged as a light-weight >> package that only contains the diff against (and depends on) >> BSgenome.Hsapiens.GRCh37, which would be a heavy-weight package >> (i.e. it contains the full sequences). From a user point of view >> there would be no difference between light-weight and heavy-weight, >> that is, both would contain a BSgenome object that looks like it >> contains a full assembly. But from a data management point of view >> that would make a big difference as we would then be able to keep >> up with the pace of GRCh37 patches by providing one BSgenome data >> package for each of them at a low cost. >> >> Cheers, >> H. >> >> >>> >>> I see a few options here. One obvious one would be to stick with UCSC >>> hg19 and >>> use the UCSC ensGene table, but others in my working group are using >>> ens73 so >>> this is a suboptimal solution. Is there an updated BSGenome available >>> for >>> GRCh37.p12, or an easy way to forge one? Have others encountered this >>> issue? >>> >>> Thanks! >>> Tim Johnstone >>> >>> -- output of sessionInfo(): >>> >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] grDevices datasets splines tcltk utils parallel stats >>> graphics methods >>> [10] base >>> >>> other attached packages: >>> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BiocInstaller_1.12.0 >>> [3] data.table_1.8.10 Hmisc_3.12-2 >>> [5] Formula_1.1-1 survival_2.37-4 >>> [7] plyr_1.8 gdata_2.13.2 >>> [9] ShortRead_1.20.0 lattice_0.20-24 >>> [11] rtracklayer_1.22.0 Rsamtools_1.14.1 >>> [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0 >>> [15] Biostrings_2.30.0 lessR_2.9.7 >>> [17] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 >>> [19] Biobase_2.22.0 GenomicRanges_1.14.3 >>> [21] XVector_0.2.0 IRanges_1.20.4 >>> [23] BiocGenerics_0.8.0 >>> >>> loaded via a namespace (and not attached): >>> [1] biomaRt_2.18.0 bitops_1.0-6 car_2.0-19 >>> cluster_1.14.4 >>> [5] DBI_0.2-7 foreign_0.8-57 grid_3.0.2 >>> gtools_3.1.0 >>> [9] hwriter_1.3 latticeExtra_0.6-26 leaps_2.9 >>> MASS_7.3-29 >>> [13] MBESS_3.3.3 nnet_7.3-7 RColorBrewer_1.0-5 >>> RCurl_1.95-4.1 >>> [17] rpart_4.1-3 RSQLite_0.11.4 stats4_3.0.2 >>> tools_3.0.2 >>> [21] XML_3.95-0.2 zlibbioc_1.8.0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.7 years ago Hervé Pagès 16k