TCGA data Query (GDCquery): external_gene_name " are missing
1
0
Entering edit mode
@habibolla-24859
Last seen 11 weeks ago
Morgantown

Hi, I just got some weird output from TCGA dataset. As you can see in the below picture, some of the "external_gene_name " are missing. Would you please help me out with this issue? Thank you.

query.seq <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Transcriptome Profiling", 
                      data.type = "Gene Expression Quantification",
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"),
                      workflow.type = "HTSeq - Counts")


GDCdownload(query.seq)

seq.brca <- GDCprepare(query = query.seq, summarizedExperiment = TRUE)

enter image description here

TCGAbiolinks • 170 views
ADD COMMENT
0
Entering edit mode
@kevin
Last seen 52 minutes ago
Republic of Ireland

Hi,

You already received an answer on Biostars: https://www.biostars.org/p/9473246/#9473248

The different annotation databases do not overlap perfectly. Each [database] includes transcripts and genes based on different rules. There are actually many previous questions on this topic, and even publications. It is not strictly an issue with the TCGAbiolinks package.

Let's see if we can find out more about the two circled, i.e., ENSG00000281904 and ENSG00000281920:

genes <- c('ENSG00000281904','ENSG00000281920')

1, via org.Hs.eg.db

require(org.Hs.eg.db)

mapIds(
  org.Hs.eg.db,
  keys = genes,
  column = 'SYMBOL',
  keytype = 'ENSEMBL')

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

Not there.

2, via biomaRt

require(biomaRt)

ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')

annot <- getBM(
  attributes = c(
    'hgnc_symbol',
    'external_gene_name',
    'ensembl_gene_id',
    'entrezgene_id',
    'gene_biotype'),
  filters = 'ensembl_gene_id',
  values = genes,
  mart = ensembl)

annot <- merge(
  x = as.data.frame(genes),
  y =  annot,
  by.y = 'ensembl_gene_id',
  all.x = T,
  by.x = 'genes')

annot
            genes hgnc_symbol external_gene_name entrezgene_id gene_biotype
1 ENSG00000281904          NA                 NA            NA       lncRNA
2 ENSG00000281920          NA                 NA            NA       lncRNA

Nothing there either, but at least that we can see that these are long non-coding RNAs.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you Kevin, So do you mean that I can neglect them for my analysis? Actually, when I used the "gencode.gene.info.v22.csv" file from TCGA, it has assigned some name to them (highlighted part in the first picture attached).

enter image description here

But on the other hand, my friend get the exact name of the genes one year ago by "gencode.gene.info.v22.csv", but they are not the same, I mean they have aliases. for example;

enter image description here

RP11-418H16.1 = AC007389.5
CH17-132F21.5= AC233263.6
So I'm wondering how can I get the same gene names such as "AC007389.5" instead of "RP11-418H16.1" as I mentioned above. ?

ADD REPLY

Login before adding your answer.

Traffic: 256 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6