Hello,
I have one problem with the getBM function. I'm doing a RNA-seq experiment. The file that I'm analyzing includes an expression Set object with RNA.seq count data for 700 samples as well as information about different phenotypes. I have in this RNAseq 20532 reads (entrezgene identifier)
I want to extract the annotation characteristics (start position, end position and GC content). For this purpose I use this code:
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
annot <- getBM(attributes=c("entrezgene","start_position","end_position",
"percentage_gc_content"),
filters="entrezgene",
values=rownames(counts),
mart=mart)
(counts is the file with the 20532 genes (in rownames) and the counts for every sample).
My problem starts when I get the file annot with these characteristics. This file has 21739 rows (more than the original one). I observed it and I realize that there are some genes duplicated, that is to say: there are, for instance, two different entrezgene identifiers that correspond with the same gene.
How can I resolve this problem? I would suppose that I would take the annot file with less genes, but it happened the opposite situation. Maybe the code is not correct or is easy to find a solution, but I'm very new in R and Bioconductor. Any help?
Thanks in advance,
Jose