Question

Extracting Sequences from Fasta Files with Rsamtools

2

Entering edit mode

jlee ▴ 20

@jlee-23653

Last seen 3.7 years ago

Hi,

I'm running into an error when trying to extract sequences from fasta files using Rsamtools.

I have a fasta file and a GRanges object, and I'd like to extract the sequences corresponding to the ranges from the fasta file. When using the fasta file for the whole genome, scanFa throws an error. Sequences up to chr 5: 32500021 can be retrieved, but trying to retrieve any sequence after chr 5: 32500021 results in an error.

However, when the fasta file for only chromosome 5 is used, the sequences corresponding to both ranges can be retrieved.

fasta files can be downloaded here: ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/dna/

> library(Rsamtools)
> library(GenomicRanges)

## Create reference to fasta files - one for whole genome, another containing just chr 5
> allSeq <- Rsamtools::FaFile('./Homo_sapiens.GRCh38.dna.primary_assembly.fa')
> chr5Seq <- Rsamtools::FaFile('./Homo_sapiens.GRCh38.dna.chromosome.5.fa')

## Create GRanges object
> gr <- GRanges(seqnames = '5', 
+               ranges = IRanges(start = c(32500021, 32500022), width = 1))

## When using the whole fasta file, record 1 works but record 2 fails 
> Rsamtools::scanFa(allSeq, gr) 
Error in value[[3L]](cond) :  record 2 (5:32500022-32500022) failed
  file: ./Homo_sapiens.GRCh38.dna.primary_assembly.fa

## When using the chr 5 fasta file, both records can be retrieved
> Rsamtools::scanFa(chr5Seq, gr) 
DNAStringSet object of length 2:
    width seq                                                                                     names               
[1]     1 T                                                                                       5
[2]     1 A                                                                                       5

Hope this example helps with finding the issue and thanks in advance! Session Info:


> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Singapore.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsamtools_2.4.0      Biostrings_2.56.0    XVector_0.28.0       GenomicRanges_1.40.0 GenomeInfoDb_1.24.0 
[6] IRanges_2.22.2       S4Vectors_0.26.1     BiocGenerics_0.34.0 

loaded via a namespace (and not attached):
[1] crayon_1.3.4           bitops_1.0-6           zlibbioc_1.34.0        rstudioapi_0.11        BiocParallel_1.22.0   
[6] tools_4.0.0            RCurl_1.98-1.2         compiler_4.0.0         GenomeInfoDbData_1.2.3

Rsamtools • 3.4k views

ADD COMMENT • link updated 3.5 years ago by Martin Morgan 25k • written 3.9 years ago by jlee ▴ 20

2

Entering edit mode

Hello,

Strangely, I ran your exact example (after pulling the genome files you linked) with no issues. That is, when scanning in the full genome I get the expected same output as when scanning the 5th chromosome:

Rsamtools::scanFa(allSeq, gr)
#DNAStringSet object of length 2:
#    width seq                                               names              
#[1]     1 T                                                 5
#[2]     1 A                                                 5

We are using the same package versions for GenomicRanges and Rsamtools and a similar (but not precisely the same) version of R, though the OS is different. I'd recommend experimenting with different versions of R, and running:

BiocManager::valid()

to check for possible installation issues (related to dependencies of GenomicRanges and Rsamtools, for example). I've attached my full session info at the bottom of this comment.

Best,

-Nick

P.S. I'm learning to help others.

session_info()
#─ Session info ───────────────────────────────────────────────────────────────
# setting  value
# version  R version 4.0.2 Patched (2020-06-24 r78746)
# os       CentOS Linux 7 (Core)
# system   x86_64, linux-gnu
# ui       X11
# language (EN)
# collate  en_US.UTF-8
# ctype    en_US.UTF-8
# tz       US/Eastern
# date     2020-11-04
#
#─ Packages ───────────────────────────────────────────────────────────────────
# package          * version  date       lib source
# assertthat         0.2.1    2019-03-21 [2] CRAN (R 4.0.0)
# BiocGenerics     * 0.34.0   2020-04-27 [2] Bioconductor
# BiocParallel       1.22.0   2020-04-27 [2] Bioconductor
# Biostrings       * 2.56.0   2020-04-27 [2] Bioconductor
# bitops             1.0-6    2013-08-17 [2] CRAN (R 4.0.0)
# cli                2.1.0    2020-10-12 [2] CRAN (R 4.0.2)
# crayon             1.3.4    2017-09-16 [2] CRAN (R 4.0.0)
# fansi              0.4.1    2020-01-08 [2] CRAN (R 4.0.0)
# GenomeInfoDb     * 1.24.2   2020-06-15 [2] Bioconductor
# GenomeInfoDbData   1.2.3    2020-05-18 [2] Bioconductor
# GenomicRanges    * 1.40.0   2020-04-27 [2] Bioconductor
# glue               1.4.2    2020-08-27 [1] CRAN (R 4.0.2)
# IRanges          * 2.22.2   2020-05-21 [2] Bioconductor
# RCurl              1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
# Rsamtools        * 2.4.0    2020-04-27 [2] Bioconductor
# S4Vectors        * 0.26.1   2020-05-16 [2] Bioconductor
# sessioninfo      * 1.1.1    2018-11-05 [2] CRAN (R 4.0.0)
# withr              2.3.0    2020-09-22 [2] CRAN (R 4.0.2)
# XVector          * 0.28.0   2020-04-27 [2] Bioconductor
# zlibbioc           1.34.0   2020-04-27 [2] Bioconductor
#
#[1] /users/neagles/R/4.0
#[2] /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0/R/4.0/lib64/R/site-library
#[3] /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0/R/4.0/lib64/R/library

ADD REPLY • link 3.5 years ago nick.eagles ▴ 100

score 1 · Answer 1 · 2020-11-04

Actually, I'd guess this is a bug / limitation of windows (see this issue on Github), and there is an integer overflow after 2^32-1 bases have been accessed. I believe this is a regression in Rsamtools (a solution was identified in a previous version) but I'm not sure when this will be addressed.

But check out the AnnotationHub resource

library(AnnotationHub)
hub = AnnotationHub()
query(hub, "Homo_sapiens.GRCh38.dna.primary_assembly")
dna = hub[["AH84628"]]

and then use getSeq()

> getSeq(dna, GRanges("5:32500021-32500029"))
DNAStringSet object of length 1:
    width seq
[1]     9 TAATCTTTT

(I'm not 100% sure that 2bit is supported on Windows, and I don't know whether it suffers from the same type of integer overflow problem).

score 1 · Answer 2 · 2020-11-04

Hi,

I suspect this is a Windows specific issue in the HTSlib code (Rsamtools::scanFa() is implemented on top of HTSlib). We'll look into this.

In the meantime an easy workaround is to use getSeq() on the BSgenome object from the BSgenome.Hsapiens.UCSC.hg38 package:

library(BSgenome.Hsapiens.UCSC.hg38)
gr <- GRanges(c("chr5:32500021", "chr5:32500022"))
getSeq(BSgenome.Hsapiens.UCSC.hg38, gr)
# DNAStringSet object of length 2:
#      width seq
#  [1]     1 T
#  [2]     1 A

Note that BSgenome.Hsapiens.UCSC.hg38 is based on the _original_ GRCh38 assembly whereas the Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz file in Ensembl release 91 is based on a patched version of the GRCh38 assembly. Not clear what patch level they used though because Ensembl only started to report what exact assemblies they use for each release starting with release 96. For example release 96 is based on GRCh38.p12 and release 101 (the latest release) is based on GRCh38.p13.

BTW any reason you're not using the latest release i.e. release 101?

Anyways, patched assemblies only add new sequences so don't alter the sequences of the original (non-patched) release. In particular the sequences of chromosomes 1-22, X, Y, MT are the same in all patched assemblies.

If you want to change the names of the sequences in BSgenome.Hsapiens.UCSC.hg38 to use the NCBI names, do:

seqlevelsStyle(BSgenome.Hsapiens.UCSC.hg38) <- "NCBI"
seqinfo(BSgenome.Hsapiens.UCSC.hg38)
# Seqinfo object with 595 sequences (1 circular) from GRCh38.p12 genome:
#   seqnames       seqlengths isCircular     genome
#   1               248956422      FALSE GRCh38.p12
#   2               242193529      FALSE GRCh38.p12
#   3               198295559      FALSE GRCh38.p12
#   4               190214555      FALSE GRCh38.p12
#   5               181538259      FALSE GRCh38.p12
#  ...                   ...        ...        ...
#   HSCHR22_5_CTG1     153027      FALSE GRCh38.p12
#   HSCHR22_6_CTG1     155930      FALSE GRCh38.p12
#   HSCHR22_7_CTG1     174749      FALSE GRCh38.p12
#   HSCHR22_8_CTG1     145162      FALSE GRCh38.p12
#   HSCHRX_3_CTG7      188004      FALSE GRCh38.p12

Then:

gr <- GRanges(c("5:32500021", "5:32500022"))
getSeq(BSgenome.Hsapiens.UCSC.hg38, gr)
# DNAStringSet object of length 2:
#    width seq
# [1]     1 T
# [2]     1 A

Another advantage of using BSgenome.Hsapiens.UCSC.hg38 instead of a naked FASTA file is that the sequence data in BSgenome.Hsapiens.UCSC.hg38 is stored in a 2bit file which is generally much more efficient than FASTA for random access.

Cheers,

H.

score 0 · Answer 3 · 2020-11-04

Sequences up to chr 5: 32500021 can be retrieved, but trying to retrieve any sequence after chr 5: 32500021 results in an error.

This suggests that your fasta is botched. Can you very with a quick grep that you have sequence after that point in the genome? You should examine that exact point in your file, maybe something is wrong there.