biomaRt: getSequence returns "Sequence unavailable" where I'd expect NA
0
0
Entering edit mode
Jan T Kim ▴ 50
@jan-t-kim-5650
Last seen 9.6 years ago
Dear All, I just noticed that sequence columns the data frame returned by biomaRt's getSequence function contain the string "Sequence unavailable" in certain conditions. Here's a demo: library("biomaRt"); ggMart <- useDataset("ggallus_gene_ensembl", mart = useMart("ensembl")); getSequence(id = "ENSGALG00000017787", type = "ensembl_gene_id", seqType = "coding", mart = ggMart); This gives me: coding ensembl_gene_id 1 Sequence unavailable ENSGALG00000017787 The ENSEMBL gene in question is some RNA component of a telomerase [1], which explains why there is no (protein) coding sequence. Nonetheless, I was surprised that this fact is indicated by inserting a human-readable string, rather than the machine-recognisable value NA, in this circumstance. Or as a more detailed account, I didn't notice the few "Sequence unavailable" entries in a table of thousands of rows and wrote everything into a FASTA file, and only when something further down the pipeline was surprised at the "e" (fortunately non- IUPAC), my attention was drawn to this problem. So this post is to (1) alert others to this sometimes surprising feature and (2) to suggest replacing the "Sequence unavailable" entries with NAs if the biomaRt authors should happen to read this. Best regards, Jan [1] http://www.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGAL G00000017787;r=9:19428817-19428871;t=ENSGALT00000028494 -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----*
biomaRt biomaRt • 1.6k views
ADD COMMENT

Login before adding your answer.

Traffic: 730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6