TCGA data Query (GDCquery): external_gene_name " are missing
1
0
Entering edit mode
@habibolla-24859
Last seen 12 months ago
Morgantown

Hi, I just got some weird output from TCGA dataset. As you can see in the below picture, some of the "external_gene_name " are missing. Would you please help me out with this issue? Thank you.

query.seq <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
sample.type = c("Solid Tissue Normal", "Primary Tumor"),
workflow.type = "HTSeq - Counts")

seq.brca <- GDCprepare(query = query.seq, summarizedExperiment = TRUE)


0
Entering edit mode
@kevin
Last seen 10 minutes ago
Republic of Ireland

Hi,

The different annotation databases do not overlap perfectly. Each [database] includes transcripts and genes based on different rules. There are actually many previous questions on this topic, and even publications. It is not strictly an issue with the TCGAbiolinks package.

Let's see if we can find out more about the two circled, i.e., ENSG00000281904 and ENSG00000281920:

genes <- c('ENSG00000281904','ENSG00000281920')


# 1, via org.Hs.eg.db

require(org.Hs.eg.db)

mapIds(
org.Hs.eg.db,
keys = genes,
column = 'SYMBOL',
keytype = 'ENSEMBL')

Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.


Not there.

# 2, via biomaRt

require(biomaRt)

ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')

annot <- getBM(
attributes = c(
'hgnc_symbol',
'external_gene_name',
'ensembl_gene_id',
'entrezgene_id',
'gene_biotype'),
filters = 'ensembl_gene_id',
values = genes,
mart = ensembl)

annot <- merge(
x = as.data.frame(genes),
y =  annot,
by.y = 'ensembl_gene_id',
all.x = T,
by.x = 'genes')

annot
genes hgnc_symbol external_gene_name entrezgene_id gene_biotype
1 ENSG00000281904          NA                 NA            NA       lncRNA
2 ENSG00000281920          NA                 NA            NA       lncRNA


Nothing there either, but at least that we can see that these are long non-coding RNAs.

Kevin

0
Entering edit mode

Thank you Kevin, So do you mean that I can neglect them for my analysis? Actually, when I used the "gencode.gene.info.v22.csv" file from TCGA, it has assigned some name to them (highlighted part in the first picture attached).

But on the other hand, my friend get the exact name of the genes one year ago by "gencode.gene.info.v22.csv", but they are not the same, I mean they have aliases. for example;

RP11-418H16.1 = AC007389.5
CH17-132F21.5= AC233263.6
So I'm wondering how can I get the same gene names such as "AC007389.5" instead of "RP11-418H16.1" as I mentioned above. ?