Question

Missing data after BiomaRt query

0

Entering edit mode

elodie.chapeaublanc • 0

@elodiechapeaublanc-7424

Last seen 9.1 years ago

France

Hi, 

I use often biomart r package in order to recover some informations about gene, transcript or exon.

I use an ensembl archive in order to recover exon info.

date="jan2013"
ensembl = useMart(host=paste(date,"archive.ensembl.org",sep="."), biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

When I do the following query, I obtain all ensembl_transcript_id associated to the ensembl_gene_id of interest : 

ENST_i=getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","cds_length"),filters="ensembl_gene_id",values="ENSG00000116171",mart=ensembl)

ENST_i
   ensembl_gene_id ensembl_transcript_id cds_length
1  ENSG00000116171       ENST00000371514       1644
2  ENSG00000116171       ENST00000478631       1101
3  ENSG00000116171       ENST00000528311       1401
4  ENSG00000116171       ENST00000371509       1512
5  ENSG00000116171       ENST00000407246       1572
6  ENSG00000116171       ENST00000371513        969
7  ENSG00000116171       ENST00000528809         NA
8  ENSG00000116171       ENST00000529363        671
9  ENSG00000116171       ENST00000473584         NA
10 ENSG00000116171       ENST00000430330        423
11 ENSG00000116171       ENST00000478274        421
12 ENSG00000116171       ENST00000484100        235
13 ENSG00000116171       ENST00000533119         97
14 ENSG00000116171       ENST00000435345        432
15 ENSG00000116171       ENST00000488965        180
16 ENSG00000116171       ENST00000408941        180


But when I do the same query with more ensembl_gene_id, I don't obtain the full informations and some data are missing 

ENST=getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","cds_length"),filters="ensembl_gene_id",values=unique(annot$ensembl_gene_id),mart=ensembl)


ENST[which(ENST$ensembl_gene_id=="ENSG00000116171"),]
      ensembl_gene_id ensembl_transcript_id cds_length
24456 ENSG00000116171       ENST00000407246       1572
24457 ENSG00000116171       ENST00000371513        969
24458 ENSG00000116171       ENST00000528809         NA
24459 ENSG00000116171       ENST00000529363        671
24460 ENSG00000116171       ENST00000473584         NA
24461 ENSG00000116171       ENST00000430330        423
24462 ENSG00000116171       ENST00000478274        421
24463 ENSG00000116171       ENST00000484100        235
24464 ENSG00000116171       ENST00000533119         97
24465 ENSG00000116171       ENST00000435345        432
24466 ENSG00000116171       ENST00000488965        180
24467 ENSG00000116171       ENST00000408941        180


Could you explain me why some data are missing ? Could you give the good way to retrieve full data  ? 

Thanks a lot.

Elodie

biomart bioconductor • 3.7k views

ADD COMMENT • link updated 9.1 years ago by Hervé Pagès 16k • written 9.1 years ago by elodie.chapeaublanc • 0

0

Entering edit mode

What is the value of unique(annot$ensembl_gene_id)?

ADD REPLY • link 9.1 years ago Julian Gehring ★ 1.3k

score 0 · Answer 1 · 2015-03-04

0

Entering edit mode

elodie.chapeaublanc • 0

@elodiechapeaublanc-7424

Last seen 9.1 years ago

France

The value of unique(annot$ensembl_gene_id) is a vector of ensembl_gene_id ( 43400 gene id )

ADD COMMENT • link 9.1 years ago elodie.chapeaublanc • 0

0

Entering edit mode

I cannot reproduce your example, in both cases the same result set is returned (equivalent with the results you showed for ENST). Please note that the query is very large, are always the same entries missing if you run this multiple times?

egid = getBM(attributes=c("ensembl_gene_id"), mart=ensembl)$ensembl_gene_id
length(egid) ## 62316

ENST_i=getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","cds_length"),filters="ensembl_gene_id",values="ENSG00000116171",mart=ensembl)
ENST=getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","cds_length"),filters="ensembl_gene_id",values=egid,mart=ensembl) ## takes very long
ENST_j = ENST[which(ENST$ensembl_gene_id=="ENSG00000116171"),]

## compare, ignore the rownames
rownames(ENST_j) = NULL
rownames(ENST_i) = NULL
all.equal(ENST_i, ENST_j) ## TRUE

sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.22.0

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.1 Biobase_2.26.0       BiocGenerics_0.12.1 
 [4] bitops_1.0-6         DBI_0.3.1            GenomeInfoDb_1.2.4  
 [7] IRanges_2.0.1        parallel_3.1.2       RCurl_1.95-4.5      
[10] RSQLite_1.0.0        S4Vectors_0.4.0      stats4_3.1.2        
[13] tools_3.1.2          XML_3.98-1.1

ADD REPLY • link 9.1 years ago Julian Gehring ★ 1.3k

score 0 · Answer 2 · 2015-03-04

0

Entering edit mode

elodie.chapeaublanc • 0

@elodiechapeaublanc-7424

Last seen 9.1 years ago

France

Yes, my query take very long time.

Yes, if I run multi times my query, the same data are missing.

My sessioninfo is the following :

sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] biomaRt_2.22.0 snowfall_1.84-6 snow_0.3-13

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 Biobase_2.26.0 BiocGenerics_0.12.1
[4] bitops_1.0-6 DBI_0.3.1 GenomeInfoDb_1.2.4
[7] IRanges_2.0.1 parallel_3.1.2 RCurl_1.95-4.5
[10] RSQLite_1.0.0 S4Vectors_0.4.0 stats4_3.1.2
[13] tools_3.1.2 XML_3.98-1.1

ADD COMMENT • link 9.1 years ago elodie.chapeaublanc • 0

0

Entering edit mode

Helping you on this will be hard as long we cannot reproduce this problem ourselves. Can you try to restrict the problem, for example:

Is data missing only for ENSG00000116171 or also other genes?
Is there a smaller example with the same behavior? Comparing against all genes is cumbersome.

ADD REPLY • link 9.1 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Data are missing also for gene ENSG00000100968

ADD REPLY • link 9.1 years ago elodie.chapeaublanc • 0

0

Entering edit mode

I see that the snow and snowfall packages are loaded. Did you use them for your queries?

ADD REPLY • link 9.1 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

No package snowfall and snow are not use for this query but for another part of my script.

ADD REPLY • link 9.1 years ago elodie.chapeaublanc • 0

score 0 · Answer 3 · 2015-03-04

Hi Elodie,

Not that this will help you figure out why your call to biomaRt::getBM() doesn't return all the data you expect, but I would recommend a different approach. With the GenomicFeatures package you can import all the gene, transcript, exon, and CDS info into a TxDb object first. It's easy and takes only a few minutes (about 5) to make the object. The benefit is that querying the TxDb object is fast (now the data are local) and convenient, and you can save it and re-use it later, or share it with your collaborators:

library(GenomicFeatures)
date <- "jan2013"
host <- paste(date,"archive.ensembl.org",sep=".")
txdb <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host=host)

Save the object with saveDb() so you can re-use it later (load it with loadDb()).

If you use the devel version of Bioconductor (BioC 3.1, you'll need R-devel for that), it happens that I just added the transcriptLengths() function a couple of days ago that seems to be exactly what you need here:

tx_lengths <- transcriptLengths(txdb, with.cds_len=TRUE)
head(tx_lengths)
#   tx_id         tx_name         gene_id nexon tx_len cds_len
# 1     1 ENST00000456328 ENSG00000223972     3   1657       0
# 2     2 ENST00000515242 ENSG00000223972     3   1653       0
# 3     3 ENST00000518655 ENSG00000223972     4   1483       0
# 4     4 ENST00000450305 ENSG00000223972     6    632       0
# 5     5 ENST00000473358 ENSG00000243485     3    712       0
# 6     6 ENST00000469289 ENSG00000243485     2    535       0

To see the transcript and CDS lengths for gene ENSG00000116171:

tx_lengths[tx_lengths$gene_id == "ENSG00000116171", ]
#      tx_id         tx_name         gene_id nexon tx_len cds_len
# 2986  2986 ENST00000371514 ENSG00000116171    16   2811    1644
# 2987  2987 ENST00000478631 ENSG00000116171    17   5204    1101
# 2988  2988 ENST00000528311 ENSG00000116171    15   1896    1401
# 2989  2989 ENST00000371509 ENSG00000116171    15   1889    1512
# 2990  2990 ENST00000407246 ENSG00000116171    15   2239    1572
# 2991  2991 ENST00000371513 ENSG00000116171    11   2003     969
# 2992  2992 ENST00000528809 ENSG00000116171     7    570       0
# 2996  2996 ENST00000529363 ENSG00000116171     8    671     671
# 2997  2997 ENST00000473584 ENSG00000116171     6    582       0
# 2999  2999 ENST00000430330 ENSG00000116171     5    938     423
# 3000  3000 ENST00000408941 ENSG00000116171     4   1298     180
# 3001  3001 ENST00000478274 ENSG00000116171     4    969     421
# 3002  3002 ENST00000484100 ENSG00000116171     4    671     235
# 3003  3003 ENST00000533119 ENSG00000116171     4    742      97
# 3004  3004 ENST00000435345 ENSG00000116171     5    894     432
# 3005  3005 ENST00000488965 ENSG00000116171     3   3926     180

To get the transcript and CDS lengths for the list of genes you have in annot$ensembl_gene_id, you would do:

tx_lengths[tx_lengths$gene_id %in% annot$ensembl_gene_id, ]

See ?transcriptLengths in the GenomicFeatures package for more information.

Hope this helps,

H.