Entering edit mode
Jan T Kim
▴
50
@jan-t-kim-5650
Last seen 10.3 years ago
Dear All,
I just noticed that sequence columns the data frame returned by
biomaRt's getSequence function contain the string "Sequence
unavailable"
in certain conditions. Here's a demo:
library("biomaRt");
ggMart <- useDataset("ggallus_gene_ensembl", mart =
useMart("ensembl"));
getSequence(id = "ENSGALG00000017787", type = "ensembl_gene_id",
seqType = "coding", mart = ggMart);
This gives me:
coding ensembl_gene_id
1 Sequence unavailable ENSGALG00000017787
The ENSEMBL gene in question is some RNA component of a telomerase
[1],
which explains why there is no (protein) coding sequence.
Nonetheless, I was surprised that this fact is indicated by inserting
a human-readable string, rather than the machine-recognisable value
NA, in this circumstance. Or as a more detailed account, I didn't
notice the few "Sequence unavailable" entries in a table of thousands
of rows and wrote everything into a FASTA file, and only when
something
further down the pipeline was surprised at the "e" (fortunately non-
IUPAC), my attention was drawn to this problem.
So this post is to (1) alert others to this sometimes surprising
feature
and (2) to suggest replacing the "Sequence unavailable" entries with
NAs
if the biomaRt authors should happen to read this.
Best regards, Jan
[1] http://www.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGAL
G00000017787;r=9:19428817-19428871;t=ENSGALT00000028494
--
+- Jan T. Kim
-------------------------------------------------------+
| email: jttkim at gmail.com
|
| WWW: http://www.jtkim.dreamhosters.com/
|
*-----=< hierarchical systems are for files, not for humans
>=-----*