Getting paralogus genes from biomart for mouse
2
0
Entering edit mode
@hemantcnaik-23771
Last seen 3 days ago
India

I am using biomaRt_2.48.2 I wanted paralogs genes for mouse genome within species I have tried below script its giving me an error I need gene ID with paralog percent identity attribute for filtering genes which are putative paralog please help me Mike Smith Thank you

**Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: [www.ensembl.org:443] Operation timed out after 300000 milliseconds with 58240952 bytes received**

library(biomaRt)

mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
searchAttributes(mart = mouse, pattern = "ggallus")

attributes=searchAttributes(mart = mouse, pattern = "paralog")

hgid <- getBM(attributes = "ensembl_gene_id",
filters    = "with_mmusculus_paralog",
values     = TRUE,
mart       = mouse)$ensembl_gene_id para <- getBM(attributes = attributes$name,
filters    = "ensembl_gene_id",
values     = hgid,
mart       = mouse)

biomart biomaRt Ensembl • 197 views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 16 hours ago
United States
> mmart <- useEnsembl("ensembl","mmusculus_gene_ensembl")
> hmart <- useEnsembl("ensembl", "hsapiens_gene_ensembl")
> ensids <- c("ENSMUSG00000030359", "ENSMUSG00000020804", "ENSMUSG00000025375", "ENSMUSG00000015243", "ENSMUSG00000028125", "ENSMUSG00000026944")
> humanStuff <- c("mmusculus_homolog_ensembl_gene",  "mmusculus_homolog_perc_id",
"mmusculus_homolog_goc_score", "mmusculus_homolog_orthology_confidence", "ensembl_gene_id")
> getLDS(c("mgi_symbol", "ensembl_gene_id"), "ensembl_gene_id", ensids, mmart, humanStuff, martL = hmart)
MGI.symbol     Gene.stable.ID Mouse.gene.stable.ID
1      Abca1 ENSMUSG00000015243   ENSMUSG00000015243
2      Abca2 ENSMUSG00000026944   ENSMUSG00000026944
3      Abca4 ENSMUSG00000028125   ENSMUSG00000028125
4      Aanat ENSMUSG00000020804   ENSMUSG00000020804
5       Aatk ENSMUSG00000025375   ENSMUSG00000025375
X.id..target.Mouse.gene.identical.to.query.gene
1                                         95.3118
2                                         93.1445
3                                         88.0774
4                                         83.0918
5                                         73.1441
Mouse.Gene.order.conservation.score
1                                 100
2                                  75
3                                 100
4                                 100
5                                 100
Mouse.orthology.confidence..0.low..1.high. Gene.stable.ID.1
1                                          1  ENSG00000165029
2                                          1  ENSG00000107331
3                                          1  ENSG00000198691
4                                          1  ENSG00000129673
5                                          1  ENSG00000181409
>


0
Entering edit mode

@James W. MacDonald Thanks for the reply above one will give orthologus gene info I want "Paralog genes" within species

0
Entering edit mode
0
Entering edit mode
Mike Smith ★ 5.1k
@mike-smith
Last seen 1 hour ago
EMBL Heidelberg / de.NBI

First let's address why you're seeing the error. The reason you're getting the Timeout was reached error is because BioMart has a limit of 5 minutes for queries to run. The more data you ask for the longer a query will take. In your case, when you're asking for attributes\$name that's actually 11 attributes. Combine that with ~25,000 genes and you're asking for a lot of data. BioMart isn't designed as a bulk data provider and so it times out. You can improve the chances of your query running by either reducing the number of attributes or the number of genes. You can submit multiple smaller queries and then try to stitch the results back together.

Regarding finding the homologs, I think what you've already tried looks like a reasonable strategy. However we can actually combine your two queries into one. We'll use the with_mmusculus_paralog filter to restrict our results to only those that have paralogs, and then ask for the gene name and the paralog information in the attributes argument. Here's an example:

library(biomaRt)

mouse <- useEnsembl("ensembl", dataset = "mmusculus_gene_ensembl")

res <- getBM(filter = "with_mmusculus_paralog",
value = TRUE,
attributes = c("ensembl_gene_id",
"mmusculus_paralog_ensembl_gene",
"mmusculus_paralog_orthology_type",
"mmusculus_paralog_perc_id"),
mart = mouse)

dim(res)
#> [1] 2395592       4

#>      ensembl_gene_id mmusculus_paralog_ensembl_gene
#> 1 ENSMUSG00000064345             ENSMUSG00000064367
#> 2 ENSMUSG00000064345             ENSMUSG00000064363
#> 3 ENSMUSG00000064363             ENSMUSG00000064345
#> 4 ENSMUSG00000064363             ENSMUSG00000064367
#> 5 ENSMUSG00000064367             ENSMUSG00000064345
#> 6 ENSMUSG00000064367             ENSMUSG00000064363
#> 7 ENSMUSG00002074970             ENSMUSG00002075052
#> 8 ENSMUSG00002074970             ENSMUSG00002076483
#>   mmusculus_paralog_orthology_type mmusculus_paralog_perc_id
#> 1                    other_paralog                   20.5797
#> 2                    other_paralog                   18.5507
#> 3                    other_paralog                   13.9434
#> 4                    other_paralog                   16.9935
#> 5                    other_paralog                   11.6969
#> 6                    other_paralog                   12.8501
#> 7           within_species_paralog                   79.6040
#> 8           within_species_paralog                   75.6436


I've selected three of the paralog related attributes, you can pick different ones if they're more appropriate for whatever you're trying to do. Note that the mmusculus_paralog_orthology_type column distinguishes between paralogs that appear only in mouse vs those that are also homologous across species (that's the "other_paralog" type). There's also a lot of duplication in here because for every set of homologs all possible pairings will be listed - you can see that in the first 6 lines of this output.

0
Entering edit mode

Mike Smith Thank you response, In given attribute ensemble_gene_id and mmusculus_paralog_ensemble_gene which gene IDs we have take consideration. I have RNA seq data want remove this genes from my list do you have any idea?

0
Entering edit mode

Every gene that Ensembl class as having a paralog will appear in both the ensemble_gene_id and mmusculus_paralog_ensemble_gene columns. Indeed the unique values in both of those will be identical and the same as the hgid variable in your original post. If you really want to exclude all genes that have paralogs then you can just use that set of IDs. However that will remove ~50% of the mouse genes, which probably isn't what you want.

It would be possible to work through the ensemble_gene_id column one row at a time and remove any rows that appear in the mmusculus_paralog_ensemble_gene column, to be left with one gene ID for each "paralog cluster". However that's pretty arbitrary and I think you need to clarify why you want to remove paralogous genes, and determine whether any other steps in your pipeline (like alignement) will have already taken this into account.