Mapping short reads to gene IDs in GPL9115
2
0
Entering edit mode
NS ▴ 60
@ns-7498
Last seen 5.7 years ago
United States

I have downloaded GSE31617 data. As you can see its platform is GPL9115.

In mRNA files, there are three columns:

       Tag                            CopyNumber    TPM(Transcript Per Million)
CATGTGGATGGGCTTCTTGTA                    2                          0.23
CATGGGGCCTTCCAGACCCAC                    2                          0.23
CATGTGCATTTTCAAGTGGGT                    2                          0.23
CATGAGCCACCGCGCCCTGTC                    2                          0.23
CATGGAACTAATTCGCTGACC                    2                          0.23
CATGCTGCTTCGGCCCCAGCG                    2                          0.23
CATGGTCCTCACCCAAGCCTA                    2                          0.23
CATGTCTGTGTGTGGTGAGCA                    2                          0.23

How can I map Tag to RefSeq ID or gene symbol? My goal is making a matrix of expression profile.

Unfortunately, I did not find any useful information in GPL9115SOFT formatted family file(s) in this page contains something like above columns for all of the available samples in GEO.

I appreciated it if anyone could help me.

rnaseq EST gene Id • 1.7k views
ADD COMMENT
0
Entering edit mode

This question has been posted also to Biostars: https://www.biostars.org/p/165111

ADD REPLY
0
Entering edit mode

Yeah, but I could not find a specific answer, e.g. R code or something like that.

ADD REPLY
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States

I would guess these were generated from a SAGE-seq like protocol, similar to what was used in this paper.

I worked with this type of data a bit in grad school, and you can map these sequences to their host gene in much the same way you would analyze "normal" sequencing data, ie. align the original data to the genome and just count it as normal (via something like featureCounts), or you could just align the first column of the data you've shown, then resolve those gene hits with the counts in the table. Note that you will most likely have several rows of that data file resolving to the same gene, so you'll have to figure out if you want to sum all of those up to the gene level, or analyze at the unique tag level (I previously rolled up to the gene level).

 

 

ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States

Use Biostrings::matchPDict(), e.g., following the example labelled A. A SIMPLE EXAMPLE OF EXACT MATCHING. But if the 'tpm' column is from a different mapping then your results will not make any sense.

ADD COMMENT
0
Entering edit mode

Thank you so much @Martin, especially for mentioning the point about 'tpm' column. My major is computer science and I did not know it.

ADD REPLY

Login before adding your answer.

Traffic: 655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6