Hello,
I am trying to get the locations of the 3'UTRs of all protein coding ensembl transcripts, but I realized that some transcripts are missing in the output, and I get inconsistent results with different queries.
grch37 = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
I do the following query to get all the 3'UTR regions of protein coding transcripts
chromosomes=c(1:22, "X", "Y")
utr3.coding=getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'chromosome_name', '3_utr_start', '3_utr_end', "transcript_biotype"), filters=c("transcript_biotype","chromosome_name"),values=list(c("protein_coding"), chromosomes), mart=grch37)
This query returned 115988 regions
dim(utr3.coding)
[1] 115988 6
When I query without filtering the "transcript_biotype", and filter it by myself on the query output, I get 130984 regions
utr3.all=getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'chromosome_name', '3_utr_start', '3_utr_end', 'transcript_biotype'), filters=c("chromosome_name"),values=chromosomes, mart=grch37)
utr3.all.coding=utr3.all[utr3.all$transcript_biotype=="protein_coding",]
dim(utr3.all.coding)
[1] 130984 6
When I query on a specific chromsome, I again get a different list of regions
utr3.ch17.coding=getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'chromosome_name', '3_utr_start', '3_utr_end', "transcript_biotype"), filters=c("transcript_biotype","chromosome_name"),values=list(c("protein_coding"), 17), mart=grch37)
dim(utr3.ch17)
[1] 9535 6
When I exact regions on chromosome 17 from the previous 2 queries, I found that many of the transcripts are missing
utr3.coding.17=utr3.coding[utr3.coding$chromosome_name==17,]
dim(utr3.coding.17)
[1] 6834 6
utr3.all.coding.17=utr3.all.coding[utr3.all.coding$chromosome_name==17,]
dim(utr3.all.coding.17)
[1] 8538 6
Any ideas on what is causing this discrepancy? Thank you.