Hello,
I have one problem with the getBM function. I'm doing a RNA-seq experiment. The file that I'm analyzing includes an expression Set object with RNA.seq count data for 700 samples as well as information about different phenotypes. I have in this RNAseq 20532 reads (entrezgene identifier)
I want to extract the annotation characteristics (start position, end position and GC content). For this purpose I use this code:
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
annot <- getBM(attributes=c("entrezgene","start_position","end_position",
"percentage_gc_content"),
filters="entrezgene",
values=rownames(counts),
mart=mart)
(counts is the file with the 20532 genes (in rownames) and the counts for every sample).
My problem starts when I get the file annot with these characteristics. This file has 21739 rows (more than the original one). I observed it and I realize that there are some genes duplicated, that is to say: there are, for instance, two different entrezgene identifiers that correspond with the same gene.
How can I resolve this problem? I would suppose that I would take the annot file with less genes, but it happened the opposite situation. Maybe the code is not correct or is easy to find a solution, but I'm very new in R and Bioconductor. Any help?
Thanks in advance,
Jose
If you want to make a comment or add another question, please use the ADD COMMENT link, rather than the Add your answer box below.
As to which method is 'best', that depends on how you want to define best. I could argue that the best way would be to retain the duplicates; if there are multiple regions of the genome that are thought to contain a given gene, then isn't that information relevant?
Alternatively, I could argue that you seem to primarily want the GC content, and in that case you could compute the mean GC content over all the regions that contain the gene. That might be 'best' in some sense.
Or alternatively I could argue that the 'best' method is to simply remove the duplicates as fast as possible, because who has the time? In that case, just subsetting out the duplicated Entrez Gene IDs would be fastest and easiest, and hence best.
In the end, as with most analyses, there are choices that have to be made. Your goal as an analyst is to decide what choice you want to or are willing to make, based all the criteria of interest in your analysis (time available to do the analysis, what the goals are, what hypothesis you are trying to test, etc), and have a cogent argument as to why your choice was optimal in some sense.