Question

Annotating a sequence, new to bioconductor

0

Entering edit mode

Teeps • 0

@teeps-17582

Last seen 7.3 years ago

New to BioC, just trying to learn. I’ve got a sequence file (FASTA) and an annotation file (BED) with the various ranges of genes and other features. I would like to use the BED file to annotate my sequence, and then be able to pull out portions of the sequence not found in the BE file (eg, 200 bp next to one of the genes listed in the BED file). I’m just starting out with BioC, right now I’m learning the GenomicRanges, Biostrings, and rtracklayer packages. I can import the BED file (bedfile <- import(“bedfile.txt”, format = bed) and the portion of the FASTA file that I want (sequenceICareAbout <- readDNAStringSet(“myFile.fasta”, nrec = 1, skip = 8, seek.first.red = FALSE, use.names = FALSE), but I have no real idea how to combine them. I looked a bit at the annotatr package and annorate_regions, but it’s asking for a GRanges object and I only see how to manually make a fake GRanges object with example found in the An Introduction to the Genomic Ranges Package page. Thank you for all the help!

annotate • 2.2k views

ADD COMMENT • link updated 7.3 years ago by James W. MacDonald 68k • written 7.3 years ago by Teeps • 0

score 0 · Answer 1 · 2018-09-28

Perhaps you could be a bit clearer about what you are trying to do. It looks like you want to read in a FASTA file and then use a BED file to extract sequences from that? If so, then do note that reading in a BED file using import results in a GRanges object to start with:

> z <- import("http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHmm/wgEncodeBroadHmmGm12878HMM.bed.gz")
> z
GRanges object with 571339 ranges and 4 metadata columns:
           seqnames              ranges strand |              name     score
              <Rle>           <IRanges>  <Rle> |       <character> <numeric>
       [1]     chr1         10001-10600      * | 15_Repetitive/CNV         0
       [2]     chr1         10601-11137      * | 13_Heterochrom/lo         0
       [3]     chr1         11138-11737      * |       8_Insulator         0
       [4]     chr1         11738-11937      * |       11_Weak_Txn         0
       [5]     chr1         11938-12137      * |   7_Weak_Enhancer         0
       ...      ...                 ...    ... .               ...       ...
  [571335]     chrX 155251807-155255406      * | 10_Txn_Elongation         0
  [571336]     chrX 155255407-155257806      * |       11_Weak_Txn         0
  [571337]     chrX 155257807-155258806      * |       8_Insulator         0
  [571338]     chrX 155258807-155259606      * | 13_Heterochrom/lo         0
  [571339]     chrX 155259607-155260406      * | 15_Repetitive/CNV         0
               itemRgb               thick
           <character>           <IRanges>
       [1]     #F5F5F5         10001-10600
       [2]     #F5F5F5         10601-11137
       [3]     #0ABEFE         11138-11737
       [4]     #99FF66         11738-11937
       [5]     #FFFC04         11938-12137
       ...         ...                 ...
  [571335]     #00B050 155251807-155255406
  [571336]     #99FF66 155255407-155257806
  [571337]     #0ABEFE 155257807-155258806
  [571338]     #F5F5F5 155258807-155259606
  [571339]     #F5F5F5 155259607-155260406
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

And if you then want the sequences, you can use getSeq

> library(BSgenome.Hsapiens.UCSC.hg19)

> getSeq(Hsapiens, z[1:5,])
  A DNAStringSet instance of length 5
    width seq
[1]   600 TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...CTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCT
[2]   537 GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCG...CGTCACGGTGGCGCGGCGCAGAGACGGGTAGAA
[3]   600 CCTCAGTAATCCGAAAAGCCGGGATCGACCGCCC...GCTGGGGCCTGGCCATGTGTATTTTTTTAAATT
[4]   200 TCCACTGATGATTTTGCTGCATGGCCGGTGTTGA...TTCTGTTCATGTGTATTTGCTGTCTCTTAGCCC
[5]   200 AGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAG...ATGGGCCATTGTTCATCTTCTGGCCCCTGTTGT