Question

Biostrings: retrieve 3'UTR sequence with Transcript

0

Entering edit mode

aspenaure • 0

@aspenaure-12732

Last seen 7.7 years ago

Hi,

I'm working with miRNA and Biostrings R package and I have an issue: I want to retrieve the sequence of 3'UTR extreme from a set of gene IDs. This is my code:

> ensembl <- useMart("ensembl", dataset=as.character(data_sel[i]))
> seq_new <- biomaRt::getSequence(seqType='3utr', mart=ensembl, type=gen_id_ref, id=gen_id)

And the result (truncated):

3utr                      ensembl_gene_id   
1 ENSG00000139618         GCATTTGCAAAGGCGACAATAAA....

So far, so good, but I wonder if there is any way to retrieve the field "TRANSCRIPT ID" moreover to "3utr" and "ensembl_gene_id" to get:

3utr                 ensembl_gene_id                ensembl_transcript_id

1 ENSG00000139618    GCATTTGCAAAGGCGACAATAAA....    ENST00000544455

Thank you in advance.

Fernando V.

R Biostrings sequence • 2.5k views

ADD COMMENT • link updated 7.7 years ago by Mike Smith ★ 6.6k • written 7.7 years ago by aspenaure • 0

0

Entering edit mode

Hi Fernando,

AFAICT this is a biomaRt question, not a Biostrings question. Please make sure to use proper title and tags for your question. This will increase your chance to draw attention from the right people and to get a useful answer.

Cheers,

H.

ADD REPLY • link 7.7 years ago Hervé Pagès 16k

score 0 · Answer 1 · 2017-03-31

The getSequences() function is a bit inflexible in what attributes it will return. Internally it's just calling getBM() with some preset values, so you can try using that function directly e.g.

library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

results <- getBM(attributes = c('ensembl_gene_id',
                                'ensembl_transcript_id',
                                '3utr'),
                 filters = 'ensembl_gene_id',
                 values = 'ENSG00000139618',
                 mart = ensembl)

I'll trim the output to 20 characters to print it here:

> sapply(results, strtrim, 20)
     3utr                   ensembl_gene_id   ensembl_transcript_id
[1,] "GCATTTGCAAAGGCGACAAT" "ENSG00000139618" "ENST00000544455"    
[2,] "Sequence unavailable" "ENSG00000139618" "ENST00000533776"    
[3,] "AAACACAACAAAACCATATT" "ENSG00000139618" "ENST00000528762"    
[4,] "Sequence unavailable" "ENSG00000139618" "ENST00000530893"    
[5,] "CCTCCCAAGTAGCTGGGACT" "ENSG00000139618" "ENST00000470094"    
[6,] "GCATTTGCAAAGGCGACAAT" "ENSG00000139618" "ENST00000380152"    
[7,] "Sequence unavailable" "ENSG00000139618" "ENST00000614259"

You might also have noticed that the column headers are incorrect in your data. This is something that has been patched in the developmental version of biomaRt, which is why they now to match for me.