I have a gene list and now I want to use Go or KEGG for the enrichment analysis of the top deferentially expressed genes. The problem I am facing is with the data-set. The data-set is a matrix with with approximately 570 samples and 12000 genes. The sample names are in the standard format e.g. "TCGA-3C-AALK-01A-11R-A41B-07". I get this. But the gene names are something I done understand. for example, the first five genes in the data-set are named as , "1", "87769", "144568", "2", "53947"..... I don't know if they are ENTREZ IDs or some other format of gene naming. Could someone please clarify this confusion. Furthermore, could someone provide an R code to do enrichment analysis using the above naming format. For clarification below I have provided the first 100 gene names in the data-set...
[1] "1" "87769" "144568" "2" "53947" "65985" "51166"
[8] "79719" "22848" "57505" "80755" "16" "60496" "132949"
[15] "10157" "26574" "9625" "18" "10349" "79963" "26154"
[22] "650655" "19" "20" "21" "24" "23461" "23460"
[29] "10347" "10351" "10350" "23456" "5243" "5244" "10058"
[36] "11194" "23457" "89845" "85320" "4363" "1244" "8714"
[43] "10257" "10057" "730013" "368" "6833" "10060" "215"
[50] "225" "5825" "5826" "6059" "9619" "9429" "83451"
[57] "26090" "84945" "25864" "84836" "116236" "84696" "11057"
[64] "171586" "63874" "51099" "57406" "79575" "10152" "25890"
[71] "51225" "27" "3983" "84448" "22885" "28" "26"
[78] "29" "80325" "25841" "30" "10449" "31" "32"
[85] "80724" "84129" "27034" "34" "36" "35" "37"
[92] "176" "9744" "23527" "116983" "38" "39" "64746"
[99] "79777" "91452"
Thanks .
They appear to be Entrez IDs, indeed; however, please quote the exact source of your data (and check there yourself) in order to help to confirm this.
For the enrichment work itself, you can eventually use:
Both of these accept Entrez IDs and are both Bioconductor packages.
Thanks for the answer.
Do the R libraries you listed below require internet connection.
I downloaded the data from the following link:
http://gdac.broadinstitute.org/runs/stddata_201504_02/data/BRCA/20150402/
With the following file name.
gdac.broadinstitute.orgBRCA.MergernaseqilluminahiseqrnasequnceduLevel3geneexpression_data.Level3.2015040200.0.0.tar.gz
Thanks. Then —yes— they are likely Entrez IDs. For your other question, I believe they require an Internet connection. Can you not check that yourself ... ?
I have very limited access to the internet...so I had to ask...Thanks for the answers.