Counting the number of paralogues for mouse genes gives me the wrong frequency
0
0
Entering edit mode
Jack • 0
@jack-14823
Last seen 7.0 years ago

I am trying to count the number of paralogues for the mouse homologues of the human protein-coding genes using BioMart. But for example in the 'PLIN4' gene its counting 35,000 paralogues instead of 4. 

We think it is because some genes have one to many paralogues which causes repeats. When I run a single gene its gives me back the correct number of paralogues. Is there a way to either remove these repeats from the results or a way around this so that BioMart doesn't output these repeats. 

I have also thought of maybe running one gene at a time, then counting it by setting up some sort of loop so that it does all of the genes from the list automatically. 

The code I have written so far is:

    # Load the biomaRt package:

    library(biomaRt)
    ensembl_hsapiens <- useMart("ensembl", 
                              dataset = "hsapiens_gene_ensembl")
    ensembl_mouse <- useMart("ensembl", 
                           dataset = "mmusculus_gene_ensembl")

    # Get all human protein coding genes:
  
    hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", 
                                             "external_gene_name"), 
                              filters = "biotype", 
                              values = "protein_coding", 
                              mart = ensembl_hsapiens)

    ensembl_gene_ID <- hsapien_PC_genes$ensembl_gene_id

    # Get mouse homologues

    mouse_homologues <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", 
                                           "mmusculus_homolog_associated_gene_name"), 
                            filters = "ensembl_gene_id", 
                            values = c(ensembl_gene_ID), 
                            mart = ensembl_hsapiens)

    # Get mouse external gene name 
  
    mouse_homologues_external_gene_names <- mouse_homologues$mmusculus_homolog_associated_gene_name

    mouse_paralogues <- getBM(attributes = c("hsapiens_homolog_associated_gene_name",
                                           "external_gene_name",
                                           "mmusculus_paralog_associated_gene_name"), 
                            filters = "external_gene_name", 
                            values = c(mouse_homologues_external_gene_names) , mart = ensembl_mouse)

    # Remove genes with no paralogues 
    mouse_paralogs_data <- mouse_paralogues[!(is.na(mouse_paralogues$mmusculus_paralog_associated_gene_name)
                                              | 
    mouse_paralogues$mmusculus_paralog_associated_gene_name==""), ]

    # Count paralogues per gene
  
    library(plyr)
    count_mouse_paralogues <- count(mouse_paralogs_data, "external_gene_name")
    View(count_mouse_paralogues)


Hope someone can help

Thanks 

Jack

R bioconductor biomart bioinformatics • 956 views
ADD COMMENT
0
Entering edit mode

I'm not sure I fully understand what the problem is.  I ran all your example code down to the # Count paralogues per gene line.  I then use dplyr to filter the results for only those containing the human PLIN4 gene I get 4 results, which seems to be what you want:

> library(dplyr)
> dplyr::filter(mouse_paralogs_data, hsapiens_homolog_associated_gene_name == "PLIN4" )

  hsapiens_homolog_associated_gene_name external_gene_name mmusculus_paralog_associated_gene_name
1                                 PLIN4              Plin4                              Plin3
2                                 PLIN4              Plin4                              Plin5
3                                 PLIN4              Plin4                              Plin2
4                                 PLIN4              Plin4                              Plin1
ADD REPLY

Login before adding your answer.

Traffic: 597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6