Hi,
I am trying to generate a data.frame with columns: gene_id, gene_name, seqnames,strand,start,end,exon_starts,exon_ends using Danio rerio GTF file from ensembl.
I have written an R script as below, but it takes long time to create the data.frame that I am interested. Is there a quicker way?
library("rtracklayer")
get_gene_features <- function(filename, output_file){
ngtf <- import(filename)
ngtf_df <- as.data.frame(ngtf)
ngtf <- NA
len_ngtf_df <- sum(ngtf_df$type == "gene", na.rm = TRUE)
geneFeatureDf <- subset(ngtf_df[,c(10,12,1,5,2,3,7)], ngtf_df$type == "gene")
geneFeatureDf["exon_start"] <- NA
geneFeatureDf["exon_end"] <- NA
for (i in 1:length(geneFeatureDf$gene_id)){
geneid <- geneFeatureDf[i,]$gene_id
geneFeatureDf[i,8] <- paste(subset(ngtf_df$start[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]],ngtf_df$gene_id[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == geneid & ngtf_df$type[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == "exon"),collapse = ",")
geneFeatureDf[i,9] <- paste(subset(ngtf_df$end[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]],ngtf_df$gene_id[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == geneid & ngtf_df$type[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == "exon"),collapse = ",")
write.table(geneFeatureDf[i,],file = output_file, append = TRUE,sep = "\t", col.names = FALSE, row.names = TRUE, na = "NA")
}
j = length(geneFeatureDf)-1
geneid <- geneFeatureDf[j+1,]$gene_id
geneFeatureDf[j+1,8] <- paste(subset(ngtf_df$start[rownames(geneFeatureDf)[j],],ngtf_df$gene_id[rownames(geneFeatureDf)[j],] == geneid & ngtf_df$type[rownames(geneFeatureDf)[j],] == "exon"),collapse = ",")
geneFeatureDf[j+1,9] <- paste(subset(ngtf_df$end[rownames(geneFeatureDf)[j],],ngtf_df$gene_id[rownames(geneFeatureDf)[j],] == geneid & ngtf_df$type[rownames(geneFeatureDf)[j],] == "exon"),collapse = ",")
write.table(geneFeatureDf[j+1,],file = output_file, append = TRUE,sep = "\t", col.names = FALSE, row.names = TRUE, na = "NA")
return(geneFeatureDf)
}
geneFeatureDf <- get_gene_features("Danio_rerio.GRCz10.82.gtf","geneFeatureDf.tab")
an example of the output is below:
"," "gene_id" "gene_name" "chromosome" "strand" "start" "end" "type" "exon_start" "exon_end"
"1" "ENSDARG00000104632" "si:ch73-252i11.3" "4" "-" 6733 52120 "gene" "52002,48613,13755,9547,6733,52002,48613" "52120,48740,13811,9620,7380,52061,48740"
"11" "ENSDARG00000100660" "ZC3HAV1" "4" "+" 16217 28312 "gene" "16217,18669,19012,20698,20832,22332,22924,23103,23513,24516,25485,20691,20832,22332,22924,23103,23513,24516,25485" "16549,18804,19372,20748,20958,22618,23020,23178,23643,24667,26320,20748,20958,22618,23020,23178,23643,24667,28312"
thanks in advance.
Hi Mehmet,
Sounds like it would be easier to just get that data.frame out of a TxDb object that you could first make with the
makeTxDbFromGFF()
ormakeTxDbFromBiomart()
functions from the GenomicFeatures package. Any reason you want to extract the data.frame directly from the GRanges object you get by importing the GTF file withrtracklayer::import()
? This object tends to be hard to work with. Working with the TxDb object is generally much simpler.H.
Hi Herve,
thank you very much for the reply.
I tried makeTxDbFromGFF() and it worked !!!
Thanks again.
ilyas.