How do I figure out what annotation package to use with a transcriptome?
I have reads counted with rna.fa.gz from:
ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/BUILD.37.1/RNA/
I can't find an NCBI annotation package for Mm, or any Mm annotation package that references build 37.
Names look like this:
> head(data$Name)
[1] "gi|126352347|ref|NM_028260.2|" "gi|142348699|ref|NM_010886.2|"
[3] "gi|23821034|ref|NM_017376.2|" "gi|33239323|ref|NM_146675.1|"
[5] "gi|118130860|ref|NM_133839.2|" "gi|29789242|ref|NM_024451.1|"
I thought that the numbers following 'gi|' would be entrez gene identifiers.
So I tried org.Mm.eg.db. But this doesn't seem to work:
> x <- org.Mm.egGENENAME
> mapped_genes <- mappedkeys(x)
> which(mapped_genes=="126352347")
integer(0)
> which(mapped_genes==126352347)
integer(0)
I'm confused and don't know what to try.
I figure the IDs after "ref|" are REFSEQ?
> head(as.list(org.Mm.egREFSEQ), n=2)
$`11287`
[1] "NM_007376" "NP_031402"
$`11298`
[1] "NM_009591" "NP_033721" "NR_033223" "XM_017314223" "XM_017314224"
[6] "XP_017169712" "XP_017169713"
Looks like it. So I need to chop up data$Name to get the REFSEQ out.
> IDs <- strsplit(data$Name, "\\|")
my.REFSEQ <- gsub("\\.*","",my.REFSEQ)
head(my.REFSEQ)> my.REFSEQ <- unlist(lapply(IDs, '[[', 4))
> my.REFSEQ <- gsub("\\.*","",my.REFSEQ)
> head(my.REFSEQ)
[1] "NM_0282602" "NM_0108862" "NM_0173762" "NM_1466751" "NM_1338392"
[6] "NM_0244511"
Now that I have REFSEQs, I can use select() to extract SYMBOLs right?
> select(org.Mm.eg.db, keys = my.REFSEQ, keytype = "REFSEQ", columns = "SYMBOL")
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'REFSEQ'. Please use the keys method to see a listing of valid arguments.
Apparently not. What am I doing wrong?
Figured it out. My gsub is borked.
my.REFSEQ <- gsub("\\..*","",my.REFSEQ)
Anyway, is there a way to do this without the strsplit and gsub? Do any of the annotation packages parse IDs like "gi|126352347|ref|NM_028260.2|" ?