getBM from biomaRt inconsistencies found with .gtf files across multiple species
1
0
Entering edit mode
Benjamin ▴ 20
@7e36dbc0
Last seen 5 hours ago
Canada

I work with biomaRt in R most of the time, but I do use .gtf files for alignment of our NGS data sets. While cross checking some items I noticed that the number of ensembl_gene_id do not match the number of ensembl_gene_id's found in the Homo_sapiens.GRCh38.113.gtf file. Specifically there are 86402 unique ensembl_gene_id's from biomaRt and 78932 unique ensembl_gene_id's in the .gtf file. As an example, Gene NPY4R2 has two different ensembl_gene_id's from biomaRt (ENSG00000288157, ENSG00000264717) and it only has one ensembl_gene_id from the .gtf file (ENSG00000264717).

Furthermore, I have examined rat, mouse, cow and sheep genomes. The rat, mouse, and cow are consistent between the .gtf files and their biomaRt alternative, but sheep actually has more unique ensembl_gene_ids in the .gtf (30722) file than in biomaRt (27054). Furthermore the sheep ensembl_gene_ids follow two different naming schemes. In the Ovis_aries_rambouillet.ARS-UI_Ramb_v2.0.113.gtf file, the COX1 gene comes up as ENSOARG00020000017, where as biomaRt comes up as ENSOARG00000000016.

Any clarification would be greatly appreciated on this matter. Below, I have attached the R code I have used from biomaRt, as well as the python script I used to parse the .gtf files.

#getBM method (Human)
 ensembl <- biomaRt::useMart('ensembl')
  mart <- biomaRt::useDataset("hsapiens_gene_ensembl", ensembl)


  #View(listAttributes(mart))

  BMattribHuman <- biomaRt::getBM(attributes = c('ensembl_gene_id', 'external_gene_name'),#, 'go_id'),
                             filters = "ensembl_gene_id",
                             values = "",
                             mart = mart)

#getBM method (sheep)
 ensembl <- biomaRt::useMart('ensembl')
  mart <- biomaRt::useDataset("oaries_gene_ensembl", ensembl)


  #View(listAttributes(mart))

  BMattribAries <- biomaRt::getBM(attributes = c('ensembl_gene_id', 'external_gene_name'),#, 'go_id'),
                             filters = "ensembl_gene_id",
                             values = "",
                             mart = mart)
sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Toronto
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3          generics_0.1.3          xml2_1.3.6             
 [4] RSQLite_2.3.9           stringi_1.8.4           hms_1.1.3              
 [7] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.1         
[10] fastmap_1.2.0           blob_1.2.4              jsonlite_1.8.9         
[13] progress_1.2.3          AnnotationDbi_1.66.0    GenomeInfoDb_1.40.1    
[16] DBI_1.2.3               httr_1.4.7              purrr_1.0.2            
[19] UCSC.utils_1.0.0        Biostrings_2.72.1       httr2_1.0.7            
[22] cli_3.6.3               rlang_1.1.4             crayon_1.5.3           
[25] dbplyr_2.5.0            XVector_0.44.0          Biobase_2.64.0         
[28] bit64_4.5.2             withr_3.0.2             cachem_1.1.0           
[31] yaml_2.3.10             tools_4.4.0             memoise_2.0.1          
[34] dplyr_1.1.4             filelock_1.0.3          GenomeInfoDbData_1.2.12
[37] BiocGenerics_0.50.0     curl_6.1.0              vctrs_0.6.5            
[40] R6_2.5.1                png_0.1-8               stats4_4.4.0           
[43] lifecycle_1.0.4         BiocFileCache_2.12.0    zlibbioc_1.50.0        
[46] KEGGREST_1.44.1         stringr_1.5.1           S4Vectors_0.42.1       
[49] IRanges_2.38.1          bit_4.5.0.1             pkgconfig_2.0.3        
[52] pillar_1.10.1           glue_1.8.0              xfun_0.50              
[55] tibble_3.2.1            tidyselect_1.2.1        rstudioapi_0.17.1      
[58] knitr_1.49              htmltools_0.5.8.1       rmarkdown_2.29         
[61] compiler_4.4.0          prettyunits_1.2.0       biomaRt_2.60.1

Python script below:

# For running on macOS
# python3 -m venv pythonEnv
# source pythonEnv/bin/activate
# pip install pandas



import pandas as pd
import os

# Ask user for file name and location, set to wd
file = input("Enter the file name and path: ")  #"/Users/username/..../Homo_sapiens.GRCh38.113.gtf"
savename = input("Enter the name that you would like to save the file as: ")
filePath = "/".join(file.split("/")[:-1])
print(filePath)
print(f"Does the Path exist? {os.path.exists(file)}")

# Initialize variables
data = []
genebuild_last_updated = None

# Read the file and process lines
with open(file) as f:
    for line in f:
        if line.startswith("#!genebuild-last-updated"):
            genebuild_last_updated = line.strip().split()[-1]
        if line.startswith("#"):  # Skip comments
            continue
        cols = line.strip().split("\t")
        if cols[2] != "gene":
            continue
        attributes = cols[8].split("; ")
        attr_dict = {}
        for attr in attributes:
            if " " in attr:
                key, value = attr.split(" ", 1)
                attr_dict[key] = value.strip('"')
        data.append([attr_dict.get("gene_id"), attr_dict.get("gene_name")])

# Create a DataFrame
df = pd.DataFrame(data, columns=["Ensembl_ID", "GeneName"])
df = df.sort_values(by="Ensembl_ID", ascending=True)
df.drop_duplicates().to_csv(os.path.join(filePath, f"{savename}_LastGeneBuild{genebuild_last_updated}.csv"), sep=",", index=False)

print("Done!")
.gtf biomaRt Ensembl • 244 views
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 2 hours ago
EMBL Heidelberg

I think this is because Ensembl can assign different Gene IDs to "the same gene" if it is found both on a standard chromosome and a scaffold that hasn't be placed in a chromosome. In your NPY4R2 example ENSG00000264717 is on chromosome 10 but ENSG00000288157 is on an unplaced scaffold. Presumably they're sufficiently similar / identical for the HGNC to consider them both NPY4R2, but Ensembl choose to assign separate IDs.

I haven't checked, but my guess would be that your GTF files don't contain these unplaced scaffolds, and you'd want Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf for that.

ADD COMMENT
0
Entering edit mode

I have received a response from Ensembl regarding the items I have listed above. I wanted to post it here and will continue to post updates as I hear more.

1) The human data: They ran a count for genes on GRCh38 without patches (only chr and scaffolds) and it returns 78932 / 86402 Genes. So it is the patches that are causing the difference between biomaRt. It seems as though the .gtf does not include patches, but the biomaRt does.

2) The Sheep data: This is related to a bug regarding missing FTP files for sheep Texel. The ENSOARG00000000016 COX1 is from texel, the ENSOARG00020000017 is COX1 from rambouillet. The location https://ftp.ebi.ac.uk/ensemblorg/pub/release-113/gtf/ovis_aries/ should have the files for texel, but they are missing. They have a known bug up about that which can be found at https://www.ensembl.info/known-bugs/ensembl-113/. They have stated that this shoudl be fixed when they release 114, but for now to use 112.

ADD REPLY
0
Entering edit mode

Great, thanks for the update from Ensembl. So the most relevant part of 'Location: Scaffold HG1277_PATCH: 137,021-141,806 forward strand' for ENSG00000288157 for this issue is PATCH rather than Scaffold. Good to know.

ADD REPLY

Login before adding your answer.

Traffic: 883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6