Annotating a sequence, new to bioconductor
1
0
Entering edit mode
Teeps • 0
@teeps-17582
Last seen 3.6 years ago

New to BioC, just trying to learn. I’ve got a sequence file (FASTA) and an annotation file (BED) with the various ranges of genes and other features. I would like to use the BED file to annotate my sequence, and then be able to pull out portions of the sequence not found in the BE file (eg, 200 bp next to one of the genes listed in the BED file). I’m just starting out with BioC, right now I’m learning the GenomicRanges, Biostrings, and rtracklayer packages. I can import the BED file (bedfile <- import(“bedfile.txt”, format = bed) and the portion of the FASTA file that I want (sequenceICareAbout <- readDNAStringSet(“myFile.fasta”, nrec = 1, skip = 8, seek.first.red = FALSE, use.names = FALSE), but I have no real idea how to combine them. I looked a bit at the annotatr package and annorate_regions, but it’s asking for a GRanges object and I only see how to manually make a fake GRanges object with example found in the An Introduction to the Genomic Ranges Package page. Thank you for all the help!

annotate • 562 views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 16 hours ago
United States

Perhaps you could be a bit clearer about what you are trying to do. It looks like you want to read in a FASTA file and then use a BED file to extract sequences from that? If so, then do note that reading in a BED file using import results in a GRanges object to start with:

> z <- import("http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHmm/wgEncodeBroadHmmGm12878HMM.bed.gz")
> z
GRanges object with 571339 ranges and 4 metadata columns:
seqnames              ranges strand |              name     score
<Rle>           <IRanges>  <Rle> |       <character> <numeric>
[1]     chr1         10001-10600      * | 15_Repetitive/CNV         0
[2]     chr1         10601-11137      * | 13_Heterochrom/lo         0
[3]     chr1         11138-11737      * |       8_Insulator         0
[4]     chr1         11738-11937      * |       11_Weak_Txn         0
[5]     chr1         11938-12137      * |   7_Weak_Enhancer         0
...      ...                 ...    ... .               ...       ...
[571335]     chrX 155251807-155255406      * | 10_Txn_Elongation         0
[571336]     chrX 155255407-155257806      * |       11_Weak_Txn         0
[571337]     chrX 155257807-155258806      * |       8_Insulator         0
[571338]     chrX 155258807-155259606      * | 13_Heterochrom/lo         0
[571339]     chrX 155259607-155260406      * | 15_Repetitive/CNV         0
itemRgb               thick
<character>           <IRanges>
[1]     #F5F5F5         10001-10600
[2]     #F5F5F5         10601-11137
[3]     #0ABEFE         11138-11737
[4]     #99FF66         11738-11937
[5]     #FFFC04         11938-12137
...         ...                 ...
[571335]     #00B050 155251807-155255406
[571336]     #99FF66 155255407-155257806
[571337]     #0ABEFE 155257807-155258806
[571338]     #F5F5F5 155258807-155259606
[571339]     #F5F5F5 155259607-155260406
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths

And if you then want the sequences, you can use getSeq

> library(BSgenome.Hsapiens.UCSC.hg19)

> getSeq(Hsapiens, z[1:5,])
A DNAStringSet instance of length 5
width seq
[1]   600 TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...CTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCT
[2]   537 GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCG...CGTCACGGTGGCGCGGCGCAGAGACGGGTAGAA
[3]   600 CCTCAGTAATCCGAAAAGCCGGGATCGACCGCCC...GCTGGGGCCTGGCCATGTGTATTTTTTTAAATT
[4]   200 TCCACTGATGATTTTGCTGCATGGCCGGTGTTGA...TTCTGTTCATGTGTATTTGCTGTCTCTTAGCCC
[5]   200 AGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAG...ATGGGCCATTGTTCATCTTCTGGCCCCTGTTGT
0
Entering edit mode

Hi, thank you for the help and sorry for the late reply, I had an emergency come up and didn’t have time for BioC this last week.

I’ll try to clarify a bit more. I can make a GRange object with the BED file like you showed, but it’s not referencing a genome. If I do seqinfo(x) under “genome” it’s “<NA>, so while my GRange object knows there is an IRange from 14010-14201 with a character name for the gene and which strand it’s on, it doesn’t actually have a sequence (ATGCG etc) for that range. My actual sequence is in a FASTA file on my computer, it’s not public ally available yet. There are several chromosomes in the file, I know how to pull out the portion I want as a string with readDNAStringSet, but I don’t know how to make that DNAStringSet with my sequence be the genome the GRanges object uses.

Thank you again for the help!

0
Entering edit mode

If you just have a FASTA file, you might need to forge a BSGenome object. You might also need to add the seqnames to your GRanges object. But conceptually if you have a BSGenome object and a GRanges object, you can get the sequences like I showed you.

0
Entering edit mode

Thank you for the help. I worked on making a BSGenome object for a bit but it was a bit beyond me right now. I found I could pull it up by, after making sure all the names were the same, using "getSeq(myDNAStringSet, myGRangesObject). I can pull out individual genes marked on the BED file I used for the GRanges object. Just working now on how to get chunks of DNA between them so I know what to put in my plasmids.

Thank you for all the help!