Get sequences for GRCh38 genome based on TxDb object locations
Entering edit mode
Diego Diez ▴ 760
Last seen 11 months ago

I am trying to obtain the promoter sequences of several genes using the GRCh38 genome and the information about transcripts locations from the TxDb package. Unlike hg19 assemblies, the genome package for the GRCh38 assembly is from NCBI and the transcript package from UCSF:



I have tried the following without success:

> library("BSgenome.Hsapiens.NCBI.GRCh38")
> library("TxDb.Hsapiens.UCSC.hg38.knownGene")
> Hsapiens <- BSgenome.Hsapiens.NCBI.GRCh38
> txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

# get transcript locations by gene:
>  chr_loc <- transcriptsBy(txdb, by = "gene")

# get sequence 300 bp upstream of TSS for 10 first genes.
>  pro_seq <- getPromoterSeq(chr_loc[1:10], Hsapiens, upstream=300, downstream=0)

Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  :   sequence chr19 not found
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.)

# Change chromosome naming style:
> seqlevelsStyle(Hsapiens) = "UCSC"

# Try again:
> pro_seq <- getPromoterSeq(chr_loc[1:10], Hsapiens, upstream=300, downstream=0)

Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence",  :
  sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM have incompatible genomes:
  - in 'x': GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38
  - in 'y': hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38

# Try change genome name:
> genome(Hsapiens) <- "hg38"

Error in `seqinfo<-`(`*tmp*`, value = <S4 object of class "Seqinfo">) :
  'new2old' must be specified when replacing the 'seqinfo' of a BSgenome object

The same occurs if I try to modify the genome information in the TxDb object. The help page for ?genome suggests this is the correct usage. Is there anything I am missing here? How can I get promoter sequences for the GRCh38 genome?

Incidentally, the description of the SNPlocs.Hsapiens.dbSNP141.GRCh38 package suggests there exists a `BSgenome.Hsapiens.UCSC.hg38` package that would solve this issue. However this package is not available:

> biocLite("BSgenome.Hsapiens.UCSC.hg38", type="source")
Using Bioconductor version 3.0 (BiocInstaller 1.16.1), R version 3.1.2.
Installing package(s) 'BSgenome.Hsapiens.UCSC.hg38'
Warning message:
package ‘BSgenome.Hsapiens.UCSC.hg38’ is not available (for R version 3.1.2)

The output of sessionInfo()

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.0.0
 [2] GenomicFeatures_1.18.3                 
 [3] AnnotationDbi_1.28.1                   
 [4] Biobase_2.26.0                         
 [5] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 
 [6] BSgenome_1.34.0                        
 [7] rtracklayer_1.26.2                     
 [8] Biostrings_2.34.1                      
 [9] XVector_0.6.0                          
[10] GenomicRanges_1.18.3                   
[11] GenomeInfoDb_1.2.3                     
[12] IRanges_2.0.1                          
[13] S4Vectors_0.4.0                        
[14] BiocGenerics_0.12.1                    

loaded via a namespace (and not attached):
 [1] BBmisc_1.8              BatchJobs_1.5           BiocParallel_1.0.0     
 [4] DBI_0.3.1               GenomicAlignments_1.2.1 RCurl_1.95-4.5         
 [7] RSQLite_1.0.0           Rsamtools_1.18.2        XML_3.98-1.1           
[10] base64enc_0.1-2         biomaRt_2.22.0          bitops_1.0-6           
[13] brew_1.0-6              checkmate_1.5.1         codetools_0.2-9        
[16] digest_0.6.6            fail_1.2                foreach_1.4.2          
[19] iterators_1.0.7         sendmailR_1.2-1         stringr_0.6.2          
[22] tools_3.1.2             zlibbioc_1.12.0        
bsgenome genomeinfodb transcriptdb • 3.7k views
Entering edit mode
Last seen 3 hours ago
Seattle, WA, United States

Hi Diego,

We will make a BSgenome package for hg38 soon. In the mean time you need to rename the sequences in the BSgenome object you get from the BSgenome.Hsapiens.NCBI.GRCh38 package. Look at the example in the man page for fetchExtendedChromInfoFromUCSC() (in the GenomeInfoDb package) for how to do this (just copy/paste the code). When you are done, do:

genome@seqinfo@genome[] <- "hg38"  # HACK!!!


# Seqinfo object with 455 sequences (1 circular) from hg38 genome:
#   seqnames             seqlengths isCircular genome
#   chr1                  248956422      FALSE   hg38
#   chr2                  242193529      FALSE   hg38
#   chr3                  198295559      FALSE   hg38
#   chr4                  190214555      FALSE   hg38
#   chr5                  181538259      FALSE   hg38
#   ...                         ...        ...    ...
#   chr19_KI270930v1_alt     200773      FALSE   hg38
#   chr19_KI270931v1_alt     170148      FALSE   hg38
#   chr19_KI270932v1_alt     215732      FALSE   hg38
#   chr19_KI270933v1_alt     170537      FALSE   hg38
#   chr19_GL000209v2_alt     177381      FALSE   hg38

Now you should be able to use BSgenome object genome with any TxDb object based on hg38.



Entering edit mode

Thanks Herve, this works.


Login before adding your answer.

Traffic: 293 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6