How to write a script for identifying associate gene id
Entering edit mode
abhisek001 • 0
Last seen 14 days ago

The following is a large text file I write few lines of it, I need to extract some genes from here according to my input please write a small script. I specified Input-output at the bottom of the post. This problem could seem very tough but in brief, I just want to recognize the associate gene id of one version from another, because online sites do not have the repository of this species.

SDRB02000004.1  Genbank gene    6018    10396   .   +   .   gene_id "TEA_012962"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "TEA_012962"; 
SDRB02000004.1  Genbank transcript  6018    10396   .   +   .   gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; gbkey "mRNA"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; 
SDRB02000004.1  Genbank exon    6018    6864    .   +   .   gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "1"; 
SDRB02000232.1  Genbank stop_codon  994202  994204  .   +   0   gene_id "TEA_014895"; transcript_id "gnl|WGS:SDRB|TEA016705.1"; gbkey "CDS"; locus_tag "TEA_014895"; orig_transcript_id "gnl|WGS:SDRB|TEA016705.1"; product "hypothetical protein"; protein_id "THG23623.1"; exon_number "19";

My input and desire output like the following -

Input (common gene-name)                   Output (special gene name)
TEA_012962                                           TEA014503.1
TEA_014895                                           TEA016705.1
bash r bingo • 958 views
Entering edit mode
Last seen 1 day ago
United States

You are being unnecessarily mysterious about what that is and where you got it. If you are asking people to help you with a problem, it's up to you to add enough context that people want to help, and have the ability to do so. Also, you apparently asked the same question on multiple help forums. That is in general considered bad form. You should restrict to one site.

Anyway, what you have is apparently this GTF. Which you could read in using import from the rtracklayer package, and then easily process however you want.

> download.file("", "GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> z <- gzfile("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> library(rtracklayer)
> library(GenomicFeatures)
> gr <- import(z)
> txdb <- makeTxDbFromGRanges(gr)
> tx <- transcripts(txdb, columns = c("gene_id","tx_name"))
> whatYouWant <- data.frame(Input = unlist(mcols(tx)$gene_id), Output = sapply(strsplit(mcols(tx)$tx_name, "\\|"), "[", 3))
> head(whatYouWant)
       Input      Output
1 TEA_012962 TEA014503.1
2 TEA_012972 TEA014513.1
3 TEA_012982 TEA014523.1
4 TEA_012984 TEA014525.1
5 TEA_012976 TEA014517.1
6 TEA_012980 TEA014521.1
Entering edit mode

You don't need these lines:

> z <- gzfile("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> gr <- import(z)

Instead you can import directly. Seems to me that didn't used to work? But whatever. This does work

> gr <- import("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
Entering edit mode

Thanks for helping me out but I am sorry because at that time I was in a hurry so I visited several platforms and finally I did it manually with excel. I am a beginner and hope you will understand my concern.


Login before adding your answer.

Traffic: 526 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6