Entering edit mode
svlachavas
▴
840
@svlachavas-7225
Last seen 4 months ago
Germany/Heidelberg/German Cancer Resear…
Dear Community,
i would like to annotate a created DGEList object using the DGEList function from the edgeR function,with unique gene symbols for ensemble identifiers. My approach is the following :
y <- DGEList(counts=assay(coad_clear), group=colData(coad_clear)$definition)
head(y$counts[1:3,1:3])
TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07
ENSG00000000003 7280 10395
ENSG00000000005 23 1
ENSG00000000419 2065 4158
TCGA-AU-6004-01A-11R-1723-07
ENSG00000000003 2547
ENSG00000000005 27
ENSG00000000419 1465
head(y$samples)
group lib.size norm.factors
TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor 42553617 1
TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor 60377942 1
TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor 47402733 1
TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor 46429596 1
TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal 35484802 1
TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor 15405325 1
#The one approach i followed:
dim(y)
[1] 56963 497
gene.ids <- select(org.Hs.eg.db, rownames(y), keytype="ENSEMBL",column="SYMBOL")
'select()' returned 1:many mapping between keys and columns
dim(gene.ids)
[1] 57310 2
head(gene.ids)
ENSEMBL SYMBOL
1 ENSG00000000003 TSPAN6
2 ENSG00000000005 TNMD
3 ENSG00000000419 DPM1
4 ENSG00000000457 SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938 FGR
sum(duplicated(gene.ids$ENSEMBL))
[1] 347
gene.ids <- gene.ids[!duplicated(gene.ids$ENSEMBL),]
iidentical(gene.ids$ENSEMBL,rownames(y))
[1] TRUE
y$genes <- gene.ids
head(y$genes)
ENSEMBL SYMBOL
1 ENSG00000000003 TSPAN6
2 ENSG00000000005 TNMD
3 ENSG00000000419 DPM1
4 ENSG00000000457 SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938 FGR
y2 <- y[!duplicated(y$genes$SYMBOL),]
dim(y2)
[1] 25214 497
I wanted to ask if there is a more straightforward or more accurate approach, in order to perform the above annotation ? or my implementation has any pitfalls ? I have also checked the alternative function mapIds, but this returns a vector not a data frame. My aim is to perform downstream DE gene analysis.
Thank you in advance,
Efstathios

Thanks Aaron for the update. The simpler the better.
What is "coad_clear" in this case? What data have you allotted to it? Can you please share the whole code? Or can anyone answer this question?